(12) INTERNATIONAL APPLICATION PUBLISHED UNDER THE PATENT COOPERATION TREATY (PCT) 



(19) World Intellectual Property Organization 

Internationa] Bureau 

(43) International Publication Date 
11 December 2003 (11.12.2003) 




(10) International Publication Number 

PCT WO 03/102921 Al 



(51) International Patent Classification 7 : G10L 19/00 

(21) International Application Number: PCT/CA03/00830 

(22) International Filing Date: 30 May 2003 (30.05.2003) 

(25) Filing Language: English 

(26) Publication Language: English 



(30) Priority Data: 

2,388,439 



3 1 May 2002 (3 1 .05.2002 ) CA 



(71) Applicant (for . all designated States except US): 
VOICE AGE CORPORATION [CA/CAj; 750 chemin 
Lucerne, Suite 250, Ville Mont-Royal, Quebec H3R 2H6 
(CA). 

(72) Inventors; and 

(75) Inventors/Applicants (for US only):' JELINEK, Milan 

[CA/CAJ; 925, rue Walton, Sherbrooke, Quebec J 1H 1K4 
(CA). GOURNAY, Philippe [CA/CA]; 855 rue du Mont 
Brome, Sherbrooke, Quebec J 1L 2V9 (CA). 

(74) Agents: BROUILLETTE, Robert et al.; Brouillette 
Kosie Prince, 1 100 Rene-levcsque Blvd. West, 25th Floor, 
Montreal, Quebec II3B 5C9 (CA). 



(81) Designated States (national): AE, AG, AL, AM, AT, AU, 

AZ, BA, BB, BG, BR, BY, BZ, CA, CM, CN, CO, CR, CU, 
CZ, DE, DK, DM, DZ, EC, EE, ES, El, GB, GD, GE, Gil, 
GM, HR, HU, ID, IL, IN, IS, JP, KE, KG, KP, KR, KZ, LC, 
LK, LR, LS, LT, LU, LV, MA, MD, MG, MK, MN, MW, 
MX, MZ, NT, NO, NZ, OM, PH, PL, PT, RO, RU, SC, SD, 
SE, SG, SK, SL, TJ, TM, TN, TR, IT, TZ, UA, UG, US, 
UZ, VC, VN, YU, ZA, ZM, ZW. 

(84) Designated States (regional): ARIPO patent (Gil, GM, 
KE, LS, MW, MZ, SD, SL, SZ, TZ, UG, ZM, ZW), 
Eurasian patent (AM, AZ, BY, KG, KZ, MD, RU, TJ, TM), 
European patent (AT, BE, BG, CI I, CY, CZ, DE, DK, EE, 
ES, FI, FR, GB, GR, HU, IE, IT, LU, MC, NL, PT, RO, 
SE, SI, SK, TR), OAPI patent (BF, BJ, CF, CG, CI, CM, 
GA, GN, GQ, GW, ML, MR, NE, SN, TD, TG). 

Published: 

- — with international search report 

For two-letter codes and other abbreviations, refer to the "Guid- 
ance Notes on Codes and Abbreviations" appearing at the begin- 
ning of each regular issue of the PCT Gazette. 



(54) Title: METHOD AND DEVICE FOR EFFICIENT FRAME ERASURE CONCEALMENT IN LINEAR PREDICTIVE 
BASED SPEECH CODECS 



= 102^ 



103 



104 




106 




108 

- J 


A/D 


105 
► 


Speech 


107 
► 


Channel 


Converter 




Encoder 




Encoder 



/ 



100 



116 



114 



115 




110 




109 


D/A 


113 
<^ 


Speech 


112 

W^— — i 


Channel 


Converter 




Decoder 




Decoder 



111 



101 



Communication 
Channel 



On 



O 



(57) Abstract: The present invention relates to a method and device for improving concealment of frame erasure caused by frames 
of an encoded sound signal erased during transmission from an encoder (106) to a decoder (1 10), and for accelerating recovery of 
the decoder after non erased frames of the encoded sound signal have been received. For that purpose, concealment/recovery pa- 
rameters are determined in the encoder or decoder. When determined in the encoder (106), the concealment/recovery parameters 
are transmitted to the decoder (1 10). In the decoder, erasure frame concealment and decoder recovery is conducted in response to 
the concealment/recovery parameters. The concealment/recovery parameters may be selected from the group consisting of: a signal 
classification parameter, an energy information parameter and a phase information parameter. The determination of the conceal- 
ment/recovery parameters comprises classifying the successive frames of the encoded sound signal as unvoiced, unvoiced transition, 
voiced transition, voiced, or onset, and this classification is determined on the basis of at least a part of the following parameters: 
a normalized correlation parameter, a spectral tilt parameter, a signal -to-noisc ratio parameter, a pitch stability parameter, a relative 
frame energy parameter, and a zero crossing parameter. 
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METHOD AND DEVICE FOR EFFICIENT FRAME ERASURE 
CONCEALMENT IN LINEAR PREDICTIVE BASED SPEECH CODECS 

5 

FIELD OF THE INVENTION 

The present invention relates to a technique for digitally encoding a 
sound signal, in particular but not exclusively a speech signal, in view of 
10 transmitting and/or synthesizing this sound signal. More specifically, the present 
invention relates to robust encoding and decoding of sound signals to maintain 
good performance in case of erased frame(s) due, for example, to channel errors 
in wireless systems or lost packets in voice over packet network applications. 

15 

BACKGROUND OF THE INVENTION 

The demand for efficient digital narrow- and wideband speech encoding 
techniques with a good trade-off between the subjective quality and bit rate is 

20 increasing in various application areas such as teleconferencing, multimedia, and 
wireless communications. Until recently, a telephone bandwidth constrained into 
a range of 200-3400 Hz has mainly been used in speech coding applications. 
However, wideband speech applications provide increased intelligibility and 
naturalness in communication compared to the conventional telephone 

25 bandwidth. A bandwidth in the range of 50-7000 Hz has been found sufficient for 
delivering a good quality giving an impression of face-to-face communication. For 
general audio signals, this bandwidth gives an acceptable subjective quality, but 
is still lower than the quality of FM radio or CD that operate on ranges of 20- 
1 6000 Hz and 20-20000 Hz, respectively. 

30 
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A speech encoder converts a speech signal into a digital bit stream which 
is transmitted over a communication channel or stored in a storage medium. The 
speech signal is digitized, that is, sampled and quantized with usually 16-bits per 
sample. The speech encoder has the role of representing these digital samples 
5 with a smaller number of bits while maintaining a good subjective speech quality. 
The speech decoder or synthesizer operates on the transmitted or stored bit 
stream and converts it back to a sound signal. 

Code-Excited Linear Prediction (CELP) coding is one of the best 
10 available techniques for achieving a good compromise between the subjective 
quality and bit rate. This encoding technique is a basis of several speech 
encoding standards both in wireless and wireline applications. In CELP encoding, 
the sampled speech signal is processed in successive blocks of L samples 
usually called frames, where Lis a predetermined number corresponding typically 
15 to 10-30 ms. A linear prediction (LP) filter is computed and transmitted every 
frame. The computation of the LP filter typically needs a lookahead, a 5-15 ms 
speech segment from the subsequent frame. The L-sample frame is divided into 
smaller blocks called subframes. Usually the number of subframes is three or four 
resulting in 4-10 ms subframes. In each subframe, an excitation signal is usually 
20 obtained from two components, the past excitation and the innovative, fixed- 
codebook excitation. The component formed from the past excitation is often 
referred to as the adaptive codebook or pitch excitation. The parameters 
characterizing the excitation signal are coded and transmitted to the decoder, 
where the reconstructed excitation signal is used as the input of the LP filter. 

25 

As the main applications of low bit rate speech encoding are wireless 
mobile communication systems and voice over packet networks, then increasing 
the robustness of speech codecs in case of frame erasures becomes of 
significant importance. In wireless cellular systems, the energy of the received 
30 signal can exhibit frequent severe fades resulting in high bit error rates and this 
becomes more evident at the cell boundaries. In this case the channel decoder 
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fails to correct the errors in the received frame and as a consequence, the error 
detector usually used after the channel decoder will declare the frame as erased. 
In voice over packet network applications, the speech signal is packetized where 
usually a 20 ms frame is placed in each packet. In packet-switched 
5 communications, a packet dropping can occur at a router if the number of packets 
become very large, or the packet can reach the receiver after a long delay and it 
should be declared as lost if its delay is more than the length of a jitter buffer at 
the receiver side. In these systems, the codec is subjected to typically 3 to 5% 
frame erasure rates. Furthermore, the use of wideband speech encoding is an 
10 important asset to these systems in order to allow them to compete with 
traditional PSTN (public switched telephone network) that uses the legacy narrow 
band speech signals. 

The adaptive codebook, or the pitch predictor, in CELP plays an 

15 important role in maintaining high speech quality at low bit rates. However, since 
the content of the adaptive codebook is based on the signal from past frames, 
- this makes the codec model sensitive to frame loss. In case of erased or lost 
frames, the content of the adaptive codebook at the decoder becomes different 
from its content at the encoder. Thus, after a lost frame is concealed and 

20 consequent good frames are received, the synthesized signal in the received 
good frames is different from the intended synthesis signal since the adaptive 
codebook contribution has been changed. The impact of a lost frame depends on 
the nature of the speech segment in which the erasure occurred. If the erasure 
occurs in a stationary segment of the signal then an efficient frame erasure 

25 concealment can be performed and the impact on consequent good frames can 
be minimized. On the other hand, if the erasure occurs in an speech onset or a 
transition, the effect of the erasure can propagate through several frames. For 
instance, if the beginning of a voiced segment is lost, then the first pitch period 
will be missing from the adaptive codebook content. This will have a severe effect 

30 on the pitch predictor in consequent good frames, resulting in long time before 
the synthesis signal converge to the intended one at the encoder. 
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SUMMARY OF THE INVENTION 

5 The present* invention relates to a method for improving concealment of 

frame erasure caused by frames of an encoded sound signal erased during 
transmission from an encoder to a decoder, and for accelerating recovery of the 
decoder after non erased frames of the encoded sound signal have been 
received, comprising: 
1 0 determining, in the encoder, concealment/recovery parameters; 

transmitting to the decoder the concealment/recovery parameters 
determined in the encoder; and 

in the decoder, conducting erasure frame concealment and decoder 
recovery in response to the received concealment/recovery parameters. 

15 

The present invention also relates to a method for the concealment of 
frame erasure caused by frames erased during transmission of a sound signal 
encoded under the form of signal-encoding parameters from an encoder to a 
decoder, and for accelerating recovery of the decoder after non erased frames of 
20 the encoded sound signal have been received, comprising: 

determining, in the decoder, concealment/recovery parameters from the 
signal-encoding parameters; 

in the decoder, conducting erased frame concealment and decoder 
recovery in response to the determined concealment/recovery parameters. 

25 

In accordance with the present invention, there is also provided a device 
for improving concealment of frame erasure caused by frames of an encoded 
sound signal erased during transmission from an encoder to a decoder, and for 
accelerating recovery of the decoder after non erased frames of the encoded 
30 sound signal have been received, comprising: 

means for determining, in the encoder, concealment/recovery parameters; 
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means for transmitting to the decoder the concealment/recovery 
parameters determined in the encoder; and 

in the decoder, means for conducting erasure frame concealment and 
decoder recovery in response to the received concealment/recovery parameters. 

5 

According to the invention, there is further provided a device for the 
concealment of frame erasure caused by frames erased during transmission of a 
sound signal encoded under the form of signal-encoding parameters from an 
encoder to a decoder, and for accelerating recovery of the decoder after non 
10 erased frames of the encoded sound signal have been received, comprising: 

means, for determining, in the decoder, concealment/recovery parameters 
from the signal-encoding parameters; 

in the decoder, means for conducting erased frame concealment and 
decoder recovery in response to the determined concealment/recovery 
15 parameters. 

The present invention is also concerned with a system for encoding and 
decoding a sound signal, and a sound signal decoder using the above defined 
devices for improving concealment of frame erasure caused by frames of the 
20 encoded sound signal erased during transmission from the encoder to the 
decoder, and for accelerating recovery of the decoder after non erased frames of 
the encoded sound signal have been received. 

The foregoing and other objects, advantages and features of the present 
25 invention will become more apparent upon reading of the following non restrictive 
description of illustrative embodiments thereof, given by way of example only with 
reference to the. accompanying drawings. 

30 
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BRIEF DESCRIPTION OF THE DRAWINGS 

Figure 1 is a schematic block diagram of a speech communication 
system illustrating an application of speech encoding and decoding devices in 
5 accordance with the present invention- 
Figure 2 is a schematic block diagram of an example of wideband 
encoding device (AMR-WB encoder); 

10 Figure 3 is a schematic block diagram of an example of wideband 

decoding device (AMR-WB decoder); 

Figure 4 is a simplified block diagram of the AMR-WB encoder of Figure 
2, wherein, the down-sampler module, the high-pass filter module and the pre- 
15 emphasis filter module have been grouped in a single pre-processing module, 
and wherein the closed-loop pitch search module, the zero-input response 
calculator module, the impulse response generator module, the innovative 
excitation search module and the memory update module have been grouped in 
a single closed-loop pitch and innovative codebook search module; 

20 

Figure 5 is an extension of the block diagram of Figure 4 in which 
modules related to an illustrative embodiment of the present invention have been 
added; 

25 Figure 6- is a block diagram explaining the situation when an artificial 

onset is constructed; and 

Figure 7 is a schematic diagram showing an illustrative embodiment of a 
frame classification state machine for the erasure concealment. 

30 
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DETAILED DESCRIPTION OF THE ILLUSTRATIVE EMBODIMENTS 

Although the illustrative embodiments of the present invention will be 
described in the following description in relation to a speech signal, it should be 
5 kept in mind that the concepts of the present invention equally apply to other 
types of signal, in particular but not exclusively to other types of sound signals. 

Figure 1 illustrates a speech communication system 100 depicting the 
' use of speech encoding and decoding in the context of the present invention. The 

10 speech communication system 100 of Figure 1 supports transmission of a speech 
signal across a communication channel 101. Although it may comprise for 
example a wire, an optical link or a fiber link, the. communication channel 101 
typically comprises at least in part a radio frequency link. The radio frequency link 
often supports multiple, simultaneous speech communications requiring shared 

15 bandwidth resources such as may be found with cellular telephony systems. 
Although not shown, the communication channel 101 may be replaced by a 
storage device in a single device embodiment of the system 1 00 that records and 
stores the encoded speech signal for later playback. 

20 In the speech communication system 100 of Figure 1, a microphone 102 

produces an analog speech signal 103 that is supplied to an analog-to-digital 
(A/D) converter 104 for converting it into a digital speech signal 105. A speech 
encoder 106 encodes the digital speech signal 105 to produce a set of signal- 
encoding parameters 107 that are coded into binary form and delivered to a 

25 channel encoder 108. The optional channel encoder 108 adds redundancy to the 
binary representation of the signal-encoding parameters 107 before transmitting 
them over the communication channel 101. 

In the receiver, a channel decoder 109 utilizes the said redundant 
30 information in the received bit stream 1 1 1 to detect and correct channel errors 
that occurred during the transmission. A speech decoder 110 converts the bit 
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stream 112 received from the channel decoder 109 back to a set of signal- 
encoding parameters and creates from the recovered signal-encoding parameters 
a digital synthesized speech signal 113. The digital synthesized speech signal 
113 reconstructed at the speech decoder 1 10 is converted to an analog form 114 
5 by a digital-to-analog (D/A) converter 115 and played back through a loudspeaker 
unit 116. 

The illustrative embodiment of efficient frame erasure concealment 
method disclosed in the present specification can be used with either. narrowband 

10 or wideband linear prediction based codecs. The present illustrative embodiment 
is disclosed in relation to a wideband speech codec that has been standardized 
by the International Telecommunications Union (ITU) as Recommendation 
G.722.2 and known as the AMR-WB codec (Adaptive Multi-Rate Wideband 
codec) [ITU-T Recommendation G.722.2 "Wideband coding of speech at around 

15 16 kbit/s using Adaptive Multi-Rate Wideband (AMR-WB)", Geneva, 2002]. This 
codec has also been selected by the third generation partnership project (3GPP) 
for wideband telephony in third generation wireless systems [3GPP TS 26.190, 
"AMR Wideband Speech Codec: Transcoding Functions," 3GPP Technical 
Specification]. AMR-WB can operate at 9 bit rates ranging from 6.6 to 23.85 

20 kbit/s. The bit rate of 12.65 kbit/s is used to illustrate the present invention. 

Here, it should be understood that the illustrative embodiment of efficient 
frame erasure concealment method could be applied to other types of codecs. 

25 In the following sections, an overview of the AMR-WB encoder and 

decoder will be first given. Then, the illustrative embodiment of the novel 
approach to improve the robustness of the codec will be disclosed. 

Overview of the AMR-WB encoder 

30 
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The sampled speech signal is encoded on a block by block basis by the 
encoding device 200 of Figure 2 which is broken down into eleven modules 
numbered from 201 to 21 1 . 

5 The input speech signal 212 is therefore processed on a block-by-block 

basis, i.e. in the above-mentioned L-sample blocks called frames. 

Referring to Figure 2, the sampled input speech signal 212 is down- 
sampled in a down-sampler module 201. The signal is down-sampled from 16 

10 kHz down to 12.8 kHz, using techniques well known to those of ordinary skilled in 
the art. Down-sampling increases the coding efficiency, since a smaller frequency 
bandwidth is encoded. This also reduces the algorithmic complexity since the 
number of samples in a frame is decreased. After down-sampling, the 320- 
sample frame of 20 ms is reduced to a 256-sample frame (down-sampling ratio of 

15 4/5). 

The input frame is then supplied to the optional pre-processing module 
202. Pre-processing module 202 may consist of a high-pass filter with a 50 Hz 
cut-off frequency. High-pass filter 202 removes the unwanted sound components 
20 below 50 Hz. 

The down-sampled, pre-processed signal is denoted by s p (n), n=0, 1, 2, 
...,L-1, where L is the length of the frame (256 at a sampling frequency of 12.8 
kHz). In an illustrative embodiment of the preemphasis filter 203, the signal s p (n) 
25 is preemphasized-using a filter having the following transfer function: 

P(z) = 1-fjz 1 

where p is a preemphasis factor with a value located between 0 and 1 (a typical 
30 value is fj = 0.7). The function of the preemphasis filter 203 is to enhance the high 
frequency contents of the input speech signal. It also reduces the dynamic range 
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of the input speech signal, which renders it more suitable for fixed-point 
implementation. Preemphasis. also plays an important role in achieving a proper 
overall perceptual weighting of the quantization error, which contributes to 
improved sound quality. This will be explained in more detail herein below. 

5 

The output of the preemphasis filter 203 is denoted s(n). This signal is 
used for performing LP analysis in module 204. LP analysis is a technique well 
known to those of ordinary skill in the art. In this illustrative implementation, the 
autocorrelation approach is used. In the autocorrelation approach, the signal s(n) 

10 is first windowed using, typically, a Hamming window having a length of the order 
of 30-40 ms. The autocorrelations are computed from the windowed signal, and 
Levinson-Durbin recursion is used to compute LP filter coefficients, a/, where 
7=1, ...,p, and where p is the LP order, which is typically 16 in wideband coding. 
The parameters a/ are the coefficients of the transfer function A(z) of the LP filter, 

15 which is given by the following relation: 

A(z) = 1 + ±a .z- ( 

LP analysis is performed in module 204, which also performs the 
20 quantization and interpolation of the LP filter coefficients. The LP filter coefficients 
are first transformed into another equivalent domain more suitable for 
quantization and interpolation purposes. The line spectral pair (LSP) and 
immitance spectral pair (ISP) domains are two domains in which quantization and 
interpolation can be efficiently performed. The 16 LP filter coefficients, a/, can be 
25 quantized in the order of 30 to 50 bits using split or multi-stage quantization, or a 
combination thereof. The purpose of the interpolation is to enable updating the LP 
filter coefficients every subframe while transmitting them once every frame, which 
improves the encoder performance without increasing the bit rate. Quantization 
and interpolation of the LP filter coefficients is believed to be otherwise well 
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known to those of ordinary skill in the art and, accordingly, will not be further 
described in the present specification. 

The following paragraphs will describe the rest of the coding operations 
5 performed on a subframe basis. In this illustrative implementation, the input frame 
is divided into 4 subframes of 5 ms (64 samples at the sampling frequency of 
12.8 kHz). In the following description, the filter A(z) denotes the unquantized 
interpolated LP filter of the subframe, and the filter A(z) denotes the quantized 
interpolated LP filter of the subframe. The filter A(z) is supplied every subframe to 
1 0 a multiplexer 21 3 for transmission through a communication channel. 

In analysis-by-synthesis encoders, the optimum pitch and innovation 
parameters are searched by minimizing the mean squared error between the 
input speech signal 212 and a synthesized speech signal in a perceptually 
15 weighted domain. The weighted signal s w (n) is computed in a perceptual 
weighting filter 205 in response to the signal s(n) from the pre-emphasis filter 
203. A perceptual weighting filter 205 with fixed denominator, suited for wideband 
signals, is used. An example of transfer function for the perceptual weighting filter 
205 is given by the following relation: 

20 

W(z) = A(z/ Yl )/(1~Y 2 z- i ) where 0< ^<^^ 

In order to simplify the pitch analysis, an open-loop pitch lag 7"ol is first 
estimated in an open-loop pitch search module 206 from the weighted speech 

25 signal s w (n). Then the closed-loop pitch analysis, which is performed in a closed- 
loop pitch search module 207 on a subframe basis, is restricted around the open- 
loop pitch lag Tql which significantly reduces the search complexity of the LTP 
parameters T (pitch lag) and b (pitch gain). The open-loop pitch analysis is 
usually performed in module 206 once every 10 ms (two subframes) using 

30 techniques well known to those of ordinary skill in the art. 
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The target vector x for LTP (Long Term Prediction) analysis is first 
computed. This is usually done by subtracting the zero-input response so of 
weighted synthesis filter W(z)/A(z) from the weighted speech signal s w (n). This 
zero-input response so is calculated by a zero-input response calculator 208 in 
5 response to the quantized interpolation LP filter A(z) from the LP analysis, 
quantization and interpolation module 204 and to the initial states of the weighted 
synthesis filter W(z)/A(z) stored in memory update module 211 in response to the 
LP filters A(z) and A(z), and the excitation vector u. This operation is well known 
to those of ordinary skill in the art and, accordingly, will not be further described. 



A A/-dimensional impulse response vector h of the weighted synthesis filter 
W(z)/A(z) is computed in the impulse response generator 209 using the 
coefficients of the LP filter A(z) and A(z) from module 204. Again, this operation 
is well known to those of ordinary skill in the art and, accordingly, will not be 
15 further described in the present specification. 

The closed-loop pitch (or pitch codebook) parameters b, T and j are 
computed in the closed-loop pitch search module 207, which uses the target 
vector x, the impulse response vector h and the open-loop pitch lag Tol as 
20 inputs. 

The pitch search consists of finding the best pitch lag T and gain b that 
minimize a mean squared weighted pitch prediction error, for example 



between the target vector x and a scaled filtered version of the past excitation. 

More specifically, in the present illustrative implementation, the pitch (pitch 
30 . codebook) search is composed of three stages. 



10 



25 
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In the first stage, an open-loop pitch lag Tql is estimated in the open-loop 
pitch search module 206 in response to the weighted speech signal s w (n). As 
indicated in the foregoing description, this open-loop pitch analysis is usually 
5 performed once every 10 ms (two subfrarnes) using techniques well known to 
those of ordinary skill in the art. 

In the second stage, a search criterion C is searched in the closed-loop 
pitch search module 207 for integer pitch lags around the estimated open-loop 
10 pitch lag Tol (usually ±5), which significantly simplifies the search procedure. A 
simple procedure is used for updating the filtered codevector yj (this vector is 
defined in the following description) without the need to compute the convolution 
for every pitch lag. An example of search criterion C is given by: 



1 5 \yjyr where t denotes vector transpose 



Once an optimum integer pitch lag is found in the second stage, a third 
stage of the search (module 207) tests, by means of the search criterion C, the 
fractions around that optimum integer pitch lag. For example, the AMR-WB 
20 standard uses % and Vz subsample resolution. 

In wideband signals, the harmonic structure exists only up to a certain 
frequency, depending on the speech segment. Thus, in order to achieve efficient 
representation of the pitch contribution in voiced segments of a wideband speech 
25 signal, flexibility is needed to vary the amount of periodicity over the wideband 
spectrum. This is achieved by processing the pitch codevector through a plurality 
of frequency shaping filters (for example low-pass or band-pass filters). And the 
frequency shaping filter that minimizes the mean-squared weighted error e(J) is 
selected. The selected frequency shaping filter is identified by an index j. 

30 
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The pitch codebook index T is encoded and transmitted to the multiplexer 
213 for transmission through a communication channel. The pitch gain b is 
quantized and transmitted to the multiplexer 213. An extra bit is used to encode 
the index;', this extra bit being also supplied to the multiplexer 213. 

5 

Once the pitch, or LTP (Long Term Prediction) parameters b, T, and / are 
determined, the next step is to search for the optimum innovative excitation by 
means of the innovative excitation search module 210 of Figure 2. First, the 
target vector x is updated by subtracting the LTP contribution: 

10 

x'=x-by T 

where b is the pitch gain and yj is the filtered pitch codebook vector (the past 
excitation at delay T filtered with the selected frequency shaping filter (index j) 
1 5 filter and convolved with the impulse response h). 

The innovative excitation search procedure in CELP is performed in an 
innovation codebook to find the optimum excitation codevector c/f and gain g 
which minimize the mean-squared error E between the target vector x' and a 
20 scaled filtered version of the codevector ck, for example: 

E=\\x'-gHc h ( 

where H is a lower triangular convolution matrix derived from the impulse 
25 response vector h. The index k of the innovation codebook corresponding to the 
found optimum codevector and the gain g are supplied to the multiplexer 213 
for transmission through a communication channel. 

It should be noted that the used innovation codebook is a dynamic 
30 codebook consisting of an algebraic codebook followed by an adaptive pre-filter 
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F(z) which enhances special spectral components in order to improve the 
synthesis speech quality, according to US Patent 5,444,816 granted to Adoul et 
ai. on August 22, 1995. in this illustrative implementation, the innovative 
codebook search is performed in module 210 by means of an algebraic codebook 
5 as described in US patents Nos: 5,444,816 (Adoul et al.) issued on August 22, 
1995; 5,699,482 granted to Adoul et al., on December 17, 1997; 5,754,976 
granted to Adoul et al., on May 19, 1998; and 5,701,392 (Adoul et al.) dated 
December 23, 1997. 

1 0 Overview ofAMR-WB Decoder 

The speech decoder 300 of Figure 3 illustrates the various steps carried 
out between the digital input 322 (input bit stream to the demultiplexer 317) and 
the output sampled speech signal 323 (output of the adder 321). 

15 

Demultiplexer 317 extracts the synthesis model parameters from the 
binary information (input bit stream 322) received from a digital input channel. 
From each received binary frame, the extracted parameters are: 

20 • the quantized, interpolated LP coefficients A(z) also called 

short-term prediction parameters (STP) produced once per frame; 

• the long-term prediction (LTP) parameters 7, b, and / (for each 
subframe); and 

25 

• the innovation codebook index k and gain g (for each 
subframe). 

The current speech signal is synthesized based on these parameters as 
30 will be explained hereinbelow. 
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The innovation codebook 318 is responsive to the index k to produce the 
innovation codevector c/o which is scaled by the decoded gain factor g through 
an amplifier 324. In the illustrative implementation, an innovation codebook as 
described in the above mentioned US patent numbers 5,444,816; 5,699,482; 
5 5,754,976; and 5,701 ,392 is used to produce the innovative codevector c/c. 

The generated scaled codevector at the output of the amplifier 324 is 
processed through a frequency-dependent pitch enhancer 305. 

10 Enhancing the periodicity of the excitation signal u improves the quality of 

voiced segments. The periodicity enhancement is achieved by filtering the 
innovative codevector c/c from the innovation (fixed) codebook through an 
innovation filter F(z) (pitch enhancer 305) whose frequency response emphasizes 
the higher frequencies more than the lower frequencies. The coefficients of the 

15 innovation filter F(z) are related to the amount of periodicity in the excitation 
signal u. 

An efficient, illustrative way to derive the coefficients of the innovation filter 
F(z) is to relate them to the amount of pitch contribution in the total excitation 

20 signal u. This results in a frequency response depending on the subframe 
periodicity, where higher frequencies are more strongly emphasized (stronger 
overall slope) for higher pitch gains. The innovation filter 305 has the effect of 
lowering the energy of the innovation codevector at lower frequencies when 
the excitation signal u is more periodic, which enhances the periodicity of the 

25 excitation signal u at lower frequencies more than higher frequencies. A 
suggested form for the innovation filter 305 is the following: 

F(z) = ~az + 1 -az~ 1 
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where a is a periodicity factor derived from the level of periodicity of the excitation 
signal u. The periodicity factor a is computed in the voicing factor generator 304. 
First, a voicing factor r v is computed in voicing factor generator 304 by: 

5 r v =(E v -E c )/(E v +E c ) 

where E v is the energy of the scaled pitch codevector ovy and E c is the energy 
of the scaled innovative codevector gcfc. That is: 

E v =b 2 vl r v T =b 2 f j v 2 T (n) 
10 n=0 

and 

n=0 

15 

Note that the value of r v lies between -1 and 1 (1 corresponds to purely voiced 
signals and -1 corresponds to purely unvoiced signals). 

The above mentioned scaled pitch codevector bvj is produced by 
20 applying the pitch delay Tto a pitch codebook 301 to produce a pitch codevector. 
The pitch codevector is then processed through a low-pass filter 302 whose cut- 
off frequency is selected in relation to index j from the demultiplexer 317 to 
produce the filtered pitch codevector vj. Then, the filtered pitch .codevector vj \s 
then amplified by the pitch gain b by an amplifier 326 to produce the scaled pitch 
25 codevector bvj. 

In this illustrative implementation, the factor a is then computed in voicing 
factor generator 304 by: 
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a =0.125(1 + r v ) 

which corresponds to a value of 0 for purely unvoiced signals and 0.25 for purely 
voiced signals. 

The enhanced signal cf is therefore computed by filtering the scaled 
innovative codevector gck through the innovation filter 305 (F(z)). 



10 



The enhanced excitation signal u' is computed by the adder 320 as: 

u'= cf + bvj 



It should be noted that this process is not performed at the encoder 200. 
Thus, it is essential to update the content of the pitch codebook 301 using the 
15 past value of the excitation signal u without enhancement stored in memory 303 
to keep synchronism between the encoder 200 and decoder 300. Therefore, the 
excitation signal u is used to update the memory 303 of the pitch codebook 301 
and the enhanced excitation signal u' is used at the input of the LP synthesis filter 
306. 

20 

The synthesized signal s' is computed by filtering the enhanced excitation 
signal u' through the LP synthesis filter 306 which has the form 1/A(z), where 
A(z) is the quantized, interpolated LP filter in the current subframe. As can be 
seen in Figure 3, the quantized, interpolated LP coefficients A(z) on line 325 from 
25 the demultiplexer 317 are supplied to the LP synthesis filter 306 to adjust the 
parameters of the LP synthesis filter 306 accordingly. The deemphasis filter 307 
is the inverse of the preemphasis filter 203 of Figure 2. The transfer function of 
the deemphasis filter 307 is given by 

30 D(z) = 1/(1-fjz~ 1 ) 
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where fj is a preemphasis factor with a value located between 0 and 1 (a typical 
value is [j = 0.7). A higher-order filter could also be used. 

The vector s' is filtered through the deemphasis filter D(z) 307 to obtain 
5 the vector Sd, which is processed through the high-pass filter 308 to remove the 
unwanted frequencies below 50 Hz and further obtain s^. 

The oversampler 309 conducts the inverse process of the downsampler 
201 of Figure 2. In this illustrative embodiment, over-sampling converts the 12.8 
10 kHz sampling rate back to the original 16 kHz sampling rate, using techniques 
well known to those of ordinary skill in the art. The oversampled synthesis signal 

is denoted $ . Signal s is also referred to as the synthesized wideband 
intermediate signal. 

15" The oversampled synthesis signal s does not contain the higher 

frequency components which were lost during the downsampling process 
(module 201 of Figure 2) at the encoder 200. This gives a low-pass perception to 
the synthesized speech signal. To restore the full band of the original signal, a 
high frequency generation procedure is performed in module 310 and requires 

20 input from voicing factor generator 304 (Figure 3). 

The resulting band-pass filtered noise sequence z from the high frequency 
generation module 310 is added by the adder 321 to the oversampled 
synthesized speech signal $ to obtain the final reconstructed output speech 
25 signal s ou t on the output 323. An example of high frequency regeneration 
process is described in International PCT patent application published under No. 
WO 00/25305 on May 4, 2000. 

The bit allocation of the AMR-WB codec at 1 2.65 kbit/s is given in Table 1 . 

30 
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Table 1. Bit allocation in the 12.65-kbit/s mode 



Parameter 


Bits / Frame 


LP Parameters 
Pitch Delay 
Pitch Filtering 
Gains 

Algebraic Codebook 
Mode Bit 


46 

30= 9+6+9+6 
4 = 1+1+1+1 
28 = 7+7+7+7 
144 = 36 + 36 + 36 + 36 
1 


Total 


253 bits = 12.65 kbit/s 



5 Robust Frame erasure concealment 

The erasure of frames has a major effect on the synthesized speech . 
quality in digital speech communication systems, especially when operating in 
wireless environments and packet-switched networks. In wireless cellular 

10 systems, the energy of the received signal can exhibit frequent severe fades 
resulting in high bit error rates and this becomes more evident at the cell 
boundaries. In this case the channel decoder fails to correct the errors in the 
received frame and as a consequence, the error detector usually used after the 
channel decoder will declare the frame as erased. In voice over packet network 

15 applications, such as Voice over Internet Protocol (VoIP), the speech signal is 
packetized where usually a 20 ms frame is placed in each packet. In packet- 
switched communications, a packet dropping can occur at a router if the number 
of packets becomes very large, or the packet can arrive at the receiver after a 
long delay and it should be declared as lost if its delay is more than the length of 

20 a jitter buffer at the receiver side. In these systems, the codec is subjected to 
typically 3 to 5% frame erasure rates. 

The problem of frame erasure (FER) processing is basically twofold. 
First, when an erased frame indicator arrives, the missing frame must be 
25 generated by using the information sent in the previous frame and by estimating 
the signal evolution in the missing frame. The success of the estimation depends 
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not only on the concealment strategy, but also on the place in the speech signal 
where the erasure happens. Secondly, a smooth transition must be assured 
when normal operation recovers, i.e. when the first good frame arrives after a 
block of erased frames (one or more). This is not a trivial task as the true 
5 synthesis and the estimated synthesis can evolve differently. When the first good 
frame arrives, the decoder is hence desynchronized from the encoder. The main 
reason is that low bit rate encoders rely on pitch prediction, and during erased 
frames, the memory of the pitch predictor is no longer the same as the one at the 
encoder. The problem is amplified when many consecutive frames are erased. As 
10 for the concealment, the difficulty of the normal processing recovery depends on 
the type of speech signal where the erasure occurred. 

The negative effect of frame erasures can be significantly reduced by 
adapting the concealment and the recovery of normal processing (further 
15 recovery) to the type of the speech signal where the erasure occurs. For this 
purpose, it is necessary to classify each speech frame. This classification can be 
done at the encoder and transmitted. Alternatively, it can be estimated at the 
decoder. 

20 For the best concealment and recovery, there are few critical 

characteristics of the speech signal that must be carefully controlled. These 
critical characteristics are the signal energy or the amplitude, the amount of 
periodicity, the spectral envelope and the pitch period. In case of a voiced speech 
recovery, further improvement can be achieved by a phase control. With a slight 

25 increase in the bit rate, few supplementary parameters can be quantized and 
transmitted for better control. If no additional bandwidth is available, the 
parameters can be estimated at the decoder. With these parameters controlled, 
the frame erasure concealment and recovery can be significantly improved, 
especially by improving the convergence of the decoded signal to the actual 

30 signal at the encoder and alleviating the effect of mismatch between the encoder 
and decoder when normal processing recovers. 
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In the present illustrative embodiment of the present invention, methods 
for efficient frame erasure concealment, and methods for extracting and 
transmitting parameters that will improve the performance and convergence at 
5 the decoder in the frames following an erased frame are disclosed. These 
parameters include two or more of the following: frame classification, energy, 
voicing information, and phase information. Further, methods for extracting such 
parameters at the decoder if transmission of extra bits is not possible, are 
disclosed. Finally, methods for improving the decoder convergence in good 
10 frames following an erased frame are also disclosed. 

The frame erasure concealment techniques according to the present 
illustrative embodiment have been applied to the AMR-WB codec described 
above. This codec will serve as an example framework for the implementation of 
15 the FER concealment methods in the following description. As explained above, 
the input speech signal 212 to the codec has a 16 kHz sampling frequency, but it 
is downsampled to a 12.8 kHz sampling frequency before further processing. In 
the present illustrative embodiment, FER processing is done on the 
downsampled signal. 

20 

Figure 4 gives a simplified block diagram of the AMR-WB encoder 400. In 
this simplified block diagram, the downsampler 201, high-pass filter 202 and 
preemphasis filter 203 are grouped together in the preprocessing module 401. 
Also, the closed-loop search module 207, the zero-input response calculator 208, 
25 the impulse response calculator 209, the innovative excitation search module 
210, and the memory update module 211 are grouped in a closed-loop pitch and 
innovation codebook search modules 402. This grouping is done to simplify the 
introduction of the new modules related to the illustrative embodiment of the 
present invention. 

30 
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Figure 5 is an extension of the block diagram of Figure 4 where the 
modules related to the illustrative embodiment of the present invention are added. 
In these added modules 500 to 507, additional parameters are computed, 
quantized, and transmitted with the aim to improve the FER concealment and the 
5 convergence and recovery of the decoder after erased frames. In the present 
illustrative embodiment, these parameters include signal classification, energy, 
and phase information (the estimated position of the first glottal pulse in a frame). 

In the next sections, computation and quantization of these additional 
10 parameters will be given in detail and become more apparent with reference to 
Figure 5. Among these parameters, signal classification will be treated in more 
detail. In the subsequent sections, efficient FER concealment using these 
additional parameters to improve the convergence will be explained. 

15 Signal classification for FER concealment and recovery 

The basic idea behind using a classification of the speech for a signal 
reconstruction in the presence of erased frames consists of the fact that the ideal 
concealment strategy is different for quasi-stationary speech segments and for 

20 speech segments with rapidiy.changing characteristics. While the best processing 
of erased frames in non-stationary speech segments can be summarized as a 
rapid convergence of speech-encoding parameters to the ambient noise 
characteristics, in the case of quasi-stationary signal, the speech-encoding 
parameters do not vary dramatically and can be kept practically unchanged 

25 during several adjacent erased frames before being damped. Also, the optimal 
method for a signal recovery following an erased block of frames varies with the 
classification of the speech signal. 

The speech signal can be roughly classified as voiced, unvoiced and 
30 pauses. Voiced speech contains an important amount of periodic components 
and can be further divided in the following categories: voiced onsets, voiced 
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segments, voiced transitions and voiced offsets. A voiced onset is defined as a 
beginning of a voiced speech segment after a pause or an unvoiced segment. 
During voiced segments, the speech signal parameters (spectral envelope, pitch 
period, ratio of periodic and non-periodic components, energy) vary slowly from 
5 frame to frame. A voiced transition is characterized by rapid variations of a voiced 
speech, such as a transition between vowels. Voiced offsets are characterized by 
a gradual decrease of energy and voicing at the end of voiced segments. 

The unvoiced parts of the signal are characterized by missing the periodic 
10 component and can be further divided into unstable frames, where the energy 
and the spectrum changes rapidly, and stable frames where these characteristics 
remain relatively stable. Remaining frames are classified as silence. Silence 
frames comprise all frames without active speech, i.e. also noise-only frames if a 
background noise is present. 



15 



Not all of the above mentioned classes need a separate processing. 
Hence, for the purposes of error concealment techniques, some of the signal 
classes are grouped together. 

20 Classification at the encoder 

When there is an available bandwidth in the bitstream to include the 
classification information, the classification can be done at the encoder. This has 
several advantages. The most important is that there is often a look-ahead in 

25 speech encoders. The look-ahead permits to estimate the evolution of the signal 
in the following frame and consequently the classification can be done by taking 
into account the future signal behavior. Generally, the longer is the look-ahead, 
the better can be the classification. A further advantage is a complexity reduction, 
as most of the signal processing necessary for frame erasure concealment is 

30 needed anyway for speech encoding. Finally, there is also the advantage to work 
with the original signal instead of the synthesized signal. 
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The frame classification is done with the consideration of the concealment 
and recovery strategy in mind. In other words, any frame is classified in such a 
way that the concealment can be optimal if the following frame is missing, or that 
5 the recovery can be optimal if the previous frame was lost. Some of the classes 
used for the FER processing need not be transmitted, as they can be deduced 
without ambiguity at the decoder. In the present illustrative embodiment, five (5) 
distinct classes are used, and defined as follows: 

10 • UNVOICED class comprises all unvoiced speech frames and ail 

frames without active speech. A voiced offset frame can be also classified as 
UNVOICED if its end tends to be unvoiced and the concealment designed for 
unvoiced frames can be used for the following frame in case it is lost. 

1.5 • UNVOICED TRANSITION class comprises unvoiced frames with a 

possible voiced onset at the end. The onset is however still too short or not 
built well enough to use the concealment designed for voiced frames. The 
UNVOICED TRANSITION class can follow only a frame classified as 
UNVOICED or UNVOICED TRANSITION. 

20 

• VOICED TRANSITION class comprises voiced frames with relatively 
weak voiced characteristics. Those are typically voiced frames with rapidly 
changing characteristics (transitions between vowels) or voiced offsets lasting 
the whole frame. The VOICED TRANSITION class can follow only a frame 

25 classified as VOICED TRANSITION, VOICED or ONSET. 

• VOICED class comprises voiced frames with stable characteristics. 
This class can follow only a frame classified as VOICED TRANSITION, 
VOICED or ONSET. 

30 
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• ONSET class comprises all voiced frames with stable characteristics 

following a frame classified as UNVOICED or UNVOICED TRANSITION. 
Frames classified as ONSET correspond to voiced onset frames where the 
onset is already sufficiently well built for the use of the concealment designed 
5 for lost voiced frames. The concealment techniques used for a frame erasure 
following the ONSET class are the same as following the VOICED class. The 
difference is in the recovery strategy. If an ONSET class frame is lost (i.e. a 
VOICED good frame arrives after an erasure, but the last good frame before 
the erasure was UNVOICED), a special technique can be used to artificially 

10 reconstruct the lost onset. This scenario can be seen in Figure 6. The artificial 
onset reconstruction techniques will be described in more detail in the 
following description. On the other hand if an ONSET good frame arrives after 
an erasure and the last good frame before the erasure was UNVOICED, this 
special processing is not needed, as the onset has not been lost (has not 

15 been in the lost frame). 

; The classification state diagram is outlined in Figure 7. If the available 

bandwidth is sufficient, the classification is done in the encoder and transmitted 
using 2 bits. As it can be seen from Figure 7, UNVOICED TRANSITION class 

20 and VOICED TRANSITION class can be grouped together as they can be 
unambiguously differentiated at the decoder (UNVOICED TRANSITION can 
follow only UNVOICED or UNVOICED TRANSITION frames, VOICED 
TRANSITION can follow only ONSET, VOICED or VOICED TRANSITION 
frames). The following parameters are used for the classification: a normalized 

25 correlation r x , a spectral tilt measure ef, a signal to noise ratio snr, a pitch stability 
counter pc, a relative frame energy of the signal at the end of the current frame 
E s and a zero-crossing counter zc. As can be seen in the following detailed 
analysis, the computation of these parameters uses the available look-ahead as 
much as possible to take into account the behavior of the speech signal also in 

30 the following frame. 
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The normalized correlation r x is computed as part of the open-loop pitch 
search module 206 of Figure 5. This module 206 usually outputs the open-loop 
pitch estimate every 10 ms (twice per frame). Here, it is also used to output the 
normalized correlation measures. These normalized correlations are computed 
5 on the current weighted speech signal s w (n) and the past weighted speech signal 
at the open-loop pitch delay. In order to reduce the complexity, the weighted 
speech signal s w (n) is downsampled by a factor of 2 prior to the open-loop pitch 
analysis down to the sampling frequency of 6400 Hz [3GPP TS 26.190, "AMR 
Wideband Speech Codec: Transcoding Functions," 3GPP Technical 
10 Specification]. The average correlation rx is defined as 



where r x (1), r x (2) are respectively the normalized correlation of the second half of 
15 the current frame and of the look-ahead. In this illustrative embodiment, a look- 
ahead of 13 ms is used unlike the AMR-WB standard that uses 5 ms. The 
normalized correlation r x (k) is computed as follows; 



r x =0.5(r x (1) + r x (2)) 



(1) 



r x (k) = 




xx'' yy 



(2) 



20 



where 



rxy=^x(tk+i)-x(tk+i-pk ) 



Lk-1 



25 
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Lk-1 

r yy = T* x2 ( t k +i ~Pk) 

i=0 

The correlations r x (k) are computed using the weighted speech signal 
s w (n). The instants f/c are related to the current frame beginning and are equal to 
5 64 and 128 samples respectively at the sampling rate or frequency of 6.4 kHz (10 
and 20 ms). The values P/c=Tol are the selected open-loop pitch estimates. The 
length of the autocorrelation computation L/c is dependant on the pitch period. 
The values of L/c are summarized below (for the sampling rate of 6.4 kHz): 

10 Lk~ 40 samples for p/c < 31 samples 

L/c = 62 samples for p/c < 61 samples 
L/c = 115 samples for p/c > 61 samples 

These lengths assure that the correlated vector length comprises at least 
15 one pitch period which helps for a robust open-loop pitch detection. For long pitch 
periods (p-/ > 61 samples), r x (1) and r x (2) are identical, i.e. only one correlation is 
computed since the correlated vectors are long enough so that the analysis on 
the look-ahead is no longer necessary. 

20 The spectral tilt parameter ef contains the information about the frequency 

distribution of energy. In the present illustrative embodiment, the spectral tilt is 
estimated as a ratio between the energy concentrated in low frequencies and the 
energy concentrated in high frequencies. However, it can also be estimated in 
different ways such as a ratio between the two first autocorrelation coefficients of 

25 the speech signal. 

The discrete Fourier Transform is used to perform the spectral analysis in 
the spectral analysis and spectrum energy estimation module 500 of Figure 5. 
The frequency analysis and the tilt computation are done twice per frame. 256 
30 points Fast Fourier Transform (FFT) is used with a 50 percent overlap. The 
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analysis windows are placed so that all the look ahead is exploited. In this 
illustrative embodiment, the beginning of the first window is placed 24 samples 
after the beginning of the current frame. The second window is placed 128 
samples further. Different windows can be used to weight the input signal for the 
5 frequency analysis. A square root of a Hamming window (which is equivalent to a 
sine window) has been used in the present illustrative embodiment. This window 
is particularly well suited for overlap-add methods. Therefore, this particular 
spectral analysis can be used in an optional noise suppression algorithm based 
on spectral subtraction and overlap-add analysis/synthesis. 

10 

The energy in high frequencies and in low frequencies is computed in 
module 500 of Figure 5 following the perceptual critical bands. In the present 
illustrative embodiment each critical band is considered up to the following 
number [J. D. Johnston, "Transform Coding of Audio Signals Using Perceptual 
15 Noise Criteria," IEEE Jour, on Selected Areas in Communications, vol. 6, no. 2, 
pp. 314-323]: 

Critical bands = {100.0, 200.0, 300.0, 400.0, 510.0, 630.0, 770.0, 920.0, 
1080.0, 1270.0, 1480.0, 1720.0, 2000.0, 2320,0, 2700.0, 3150.0, 3700.0, 4400.0, 
20 5300.0, 6350.0} Hz. 

The energy in higher frequencies is computed in module 500 as the 
average of the energies of the last two critical bands: 

25 E h =0.5(e(18) + e(19)) (3) 

where the critical band energies e(i) are computed as a sum of the bin energies 
within the critical band, averaged by the number of the bins. 

30 The energy in lower frequencies is computed as the average of the 

energies in the first 10 critical bands. The middle critical bands have been 
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excluded from the computation to improve the discrimination between frames with 
high energy concentration in low frequencies (generally voiced) and with high 
energy concentration in high frequencies (generally unvoiced). In between, the 
energy content is not characteristic for any of the classes and would increase the 
5 decision confusion. 

In module 500, the energy in low frequencies is computed differently for 
long pitch periods and short pitch periods. For voiced female speech segments, 
the harmonic structure of the spectrum can be exploited to increase the voiced- 

10 unvoiced discrimination. Thus for short pitch periods, E / is computed bin-wise 
and only frequency bins sufficiently close to the speech harmonics are taken into 
account in the summation, i.e. 



15 



(4) 



cnt 



where erfi) are the bin energies in the first 25 frequency bins (the DC component 
is not considered). Note that these 25 bins correspond to the first 10 critical 
bands. In the above summation, only terms related to the bins closer to the 
nearest harmonics than a certain frequency threshold are non zero. The counter 

20 cnt equals to the number of those non-zero terms. The threshold for a bin to be 
included in the sum has been fixed to 50 Hz, i.e. only bins closer than 50 Hz to 
the nearest harmonics are taken into account. Hence, if the structure is harmonic 
in low frequencies, only high energy term will be included in the sum. On the 
other hand, if the structure is not harmonic, the selection of the terms will be 

25 random and the sum will be smaller. Thus even unvoiced sounds with high 
energy content in low frequencies can be detected. This processing cannot be 
done for longer pitch periods, as the frequency resolution is not sufficient. The 
threshold pitch value is 128 samples corresponding to 100 Hz. It means that for 
pitch periods longer than 128 samples and also for a priori unvoiced sounds (i.e. 
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when >*+re<0.6), the low frequency energy estimation is done per critical band 
and is computed as 



20 



(5) 



10 i=0 



The value r e , calculated in a noise estimation and normalized correlation 
correction module 501, is a correction added to the normalized correlation in 
presence of background noise for the following reason. In the presence of 
background noise, the average normalized correlation decreases. However, for 
10 purpose of signal classification, this decrease should not affect the voiced- 
unvoiced decision. It has been found that the dependence between this decrease 
re and the total background noise energy in dB is approximately exponential and 
can be expressed using following relationship 

15 r e = 2.4492 lO' 4 ■ e 0JsgsNc,B - 0.022 

where Njq stands for 



N.B^IO-logJ^-f^nO^-g^ 



20 1 



-.0 



Here, n(i) are the noise energy estimates for each critical band normalized in the 
same way as e(i) and g^s is the maximum noise suppression level in dB allowed 
for the noise reduction routine. The value re is not allowed to be negative. It 
should be noted that when a good noise reduction algorithm is used and g^g is 
25 sufficiently high, r e is practically equal to zero. It is only relevant when the noise 
reduction is disabled or if the background noise level is significantly higher than 
the maximum allowed reduction. The influence of r e can be tuned by multiplying 
this term with a constant. 
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5 



15 



20 



Finally, the resulting lower and higher frequency energies are obtained by 

WW 

subtracting an estimated noise energy from the values 1 and ' calculated 
above. That is 



E h = E h- f c ' N h (6) 
/ E, =5, -f c (7 ) 

10 where Nh and Nj are the averaged noise energies in the last two (2) critical bands 
and first ten (10) critical bands, respectively, computed using equations similar to 
Equations (3) and (5), and f c is a correction factor tuned so that these measures 
remain close to constant with varying the background noise level. In this 
illustrative embodiment, the value of f c has been fixed to 3. 



The spectral tilt et is calculated in the spectral tilt estimation module 503 
using the relation: 



E » (8) 



and it is averaged in the dB domain for the two (2) frequency analyses performed 
per frame: 

e t =10'Iog 10 (e t (0)-e t (1)) 

25 . 

The signal to noise ratio (SNR) measure exploits the fact that for a general 
waveform matching encoder, the SNR is much higher for voiced sounds. The snr 
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parameter estimation must be done at the end of the encoder subframe loop and 
is computed in the SNR computation module 504 using the relation: 

snr=^- 

Ee 0) 

where E sw is the energy of the weighted speech signal s w (n) of the current frame 
from the perceptual weighting filter 205 and E e is the energy of the error between 
this weighted speech signal and the weighted synthesis signal of the current 
frame from the perceptual weighting filter 205'. 

The pitch stability counter pc assesses the variation of the pitch period. It 
is computed within the signal classification module 505 in response to the open- 
loop pitch estimates as follows: 



15 PC = \P 1 ~Po\ + \P2-Pi 



20 



25 



(10) 



The values po, P1, P2 correspond to the open-loop pitch estimates calculated by 
the open-loop pitch search module 206 from the first half of the current frame, the 
second half of the current frame and the look-ahead, respectively. 

The relative frame energy E s is computed by module 500 as a difference 
between the current frame energy in dB and its long-term average 

• E s =E f -E lt 

where the frame energy E f is obtained as a summation of the critical band 
energies, averaged for the both spectral analysis performed each frame: 

E f =10Iog 10 (0.5E f (0) + E f (1))) 
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E f O) = Z e (0 

1-0 

The long-term averaged energy is updated on active speech frames using the 
5 following relation: 

E, t =0.99E lt +0.01E f 

The last parameter is the zero-crossing parameter zc computed on one 
10 frame of the speech signal by the zero-crossing computation module 508. The 
frame starts in the middle of the current frame and uses two (2) subframes of the 
look-ahead. In this illustrative embodiment, the zero-crossing counter zc counts 
the number of times the signal sign changes from positive to negative during that 
interval. 

15 

To make the classification more robust, the classification parameters are 
considered together forming a function of merit fm. For that purpose, the 
classification parameters are first scaled between 0 and 1 so that each 
parameter's value typical for unvoiced signal translates in 0 and each parameter's 
20 value typical for voiced signal translates into 1 . A linear function is used between 
them. Let us consider a parameter px, its scaled version is obtained using: 

25 and clipped between 0 and 1. The function coefficients k p and c p have been 
found experimentally for each of the parameters so that the signal distortion due 
to the concealment and recovery techniques used in presence of FERs is 
minimal. The values used in this illustrative implementation are summarized in 
Table 2: 

30 



BNSDOCID: <WO 03102921A1 I > 



WO 03/102921 PCT/CA03/00830 

' 35 



Table 2. Signal Classification Parameters and the coefficients 



of their respective scaling functions 



Parameter 


Meaning^ 


K 


o D 




Normalized Correlation 


2.857 


-1.286 




Spectral Tilt 


0.04167 


0 


snr 


Signal to Noise Ratio 


0.1111 


-0.3333 


pc 


Pitch Stability counter 


-0.07143 


1.857 


E s 


Relative Frame Energy 


0.05 


0.45 


zc 


Zero Crossing Counter 


-0.04 


2.4 



5 The merit function has been defined as: 

1 . _ 
f m =— (2-f x s + e* +snr s +pc s + E| + zc s ) 

where the superscript s indicates the scaled version of the parameters. 

10 

The classification is then done using the merit function f m and following 
the rules summarized in Table 3: 

Table 3. Signal Classification Rules at the Encoder 

15 



Previous Frame Class 


Rule 


Current Frame Class 


ONSET 
VOICED 

VOICED TRANSITION 


fm = 0.66 

0.66 > f m = 0.49 
f m < 0.49 
f m > 0.63 

0.63 = f m > 0.585 
f m = 0.585 


VOICED 

VOICED TRANSITION 

UNVOICED 

ONSET 

UNVOICED TRANSITION 
UNVOICED 




UNVOICED TRANSITION 
UNVOICED 







In case of source-controlled variable bit rate (VBR) encoder, a signal 
classification is inherent to the codec operation. The codec operates at several bit 
20 rates, and a rate selection module is used to determine the bit rate used for 
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encoding each speech frame based on the nature of the speech frame (e.g. 
voiced, unvoiced, transient, background noise frames are each encoded with a 
special encoding algorithm). The information about the coding mode and thus 
about the speech class is already an implicit part of the bitstream and need not 
5 be explicitly transmitted for FER processing. This class information can be then 
used to overwrite the classification decision described above. 

In the example application to the AMR WB codec, the only source- 
controlled rate selection represents the voice activity detection (VAD). This VAD 

10 flag equals 1 for active speech, 0 for silence. This parameter is useful for the 
classification as it directly indicates that no further classification is needed if its 
value is 0 (i.e. the frame is directly classified as UNVOICED). This parameter is 
the output of the voice activity detection (VAD) module 402. Different VAD 
algorithms exist in the literature and any algorithm can be used for the purpose of 

15 the present invention. For instance the VAD algorithm that is part of standard 
G.722.2 can be used [ITU-T Recommendation G.722.2 "Wideband coding of 
speech at around 16 kbit/s using Adaptive Multi-Rate Wideband (AMR-WB)", 
Geneva, 2002]. Here, the VAD algorithm is based on the output of the spectral 
analysis of module 500 (based on signai-to-noise ratio per critical band). The 

20 VAD used for the classification purpose differs from the one used for encoding 
purpose with respect to the hangover. In speech encoders using a comfort noise 
generation (CNG) for segments without active speech (silence or noise-only), a 
hangover is often added after speech spurts (CNG in AMR-WB standard is an 
example [3GPP TS 26,192, "AMR Wideband Speech Codec: Comfort Noise 

25 Aspects," 3GPP Technical Specification]). During the hangover, the speech 
encoder continues to be used and the system switches to the CNG only after the 
hangover period is over. For the purpose of classification for FER concealment, 
this high security is not needed. Consequently, the VAD flag for the classification 
will equal to 0 also during the hangover period. 

30 
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In this illustrative embodiment, the classification is performed in module 
505 based on the parameters described above; namely, normalized correlations 
(or voicing information) r x , spectral Hit ef. snr, pitch stability counter pc, relative 
frame energy E s , zero crossing rate zc, and VAD flag. 

5 

Classification at the decoder 

If the application does not permit the transmission of the class information 
(no extra bits can be transported), the classification can be still performed at the 
10 decoder. As already noted, the main disadvantage here is that there is generally 
no available look ahead in speech decoders. Also, there Is often a need to keep 
the decoder complexity limited. 

, A simple classification can be done by estimating the voicing of the 
15 synthesized signal. If we consider the case of a CELP type encoder, the voicing 
estimate r v computed as in Equation (1) can be used. That is: 

r v =(E v -E c )/(E v+ E c ) 

20 where E v is the energy of the scaled pitch codevector bvj and E c is the energy 
of the scaled innovative codevector gc/f. Theoretically, for a purely voiced signal 
rv=1 and for a purely unvoiced signal r v =-1. The actual classification is done by 
averaging r v values every 4 subframes. The resulting factor fa (average of r v 
values of eveiy four subframes) is used as follows 

25 



30 
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Table 4. Signal Classification Rules at the Decoder 



Previous Frame Class 


Rule 


Current Frame Class 


ONSET 


f„>-0A 


VOICED 


VOICED 






VOICED TRANSITION 








-0.1 =f w = -0.5 


VOICED TRANSITION 




f n < -0.5 


UNVOICED 


UNVOICED TRANSITION 


f rv >-0A 


ONSET 


UNVOICED 






-0.1 =f rv = -0.5 


UNVOICED TRANSITION 




f„<-0.5 


UNVOICED 



Similarly to the classification at the encoder, other parameters can be 
5 used at the decoder to help the classification, as the parameters of the LP filter or 
the pitch stability. 

In case of source-controlled variable bit rate coder, the information about 
the coding mode is already a part of the bitstream. Hence, if for example a purely 
10 unvoiced coding mode is used, the frame can be automatically classified as 
UNVOICED. Similarly, if a purely voiced coding mode is used, the frame is 
classified as VOICED. 

Speech parameters for FER processing 

15 

There are few critical parameters that must be carefully controlled to avoid 
annoying artifacts when FERs occur. If few extra bits can be transmitted then 
these parameters can be estimated at the encoder, quantized, and transmitted. 
Otherwise, some of them can be estimated at the decoder. These parameters 
20 include signal classification, energy information, phase information, and voicing 
information. The most important is a precise control of the speech energy. The 
phase and the speech periodicity can be controlled too for further improving the 
FER concealment and recovery. 
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The importance of the energy control manifests itself mainly , when a 
normal operation recovers after an erased block of frames. As most of speech 
encoders make use of a prediction, the right energy cannot be properly estimated 
at the decoder. In voiced speech segments, the incorrect energy can persist for 
5 several consecutive frames which is very annoying especially when this incorrect 
energy increases. 

Even if the energy control is most important for voiced speech because of 
the long term prediction (pitch prediction), it is important also for unvoiced 
10 speech. The reason here is the prediction of the innovation gain quantizer often 
used in CELP type coders. The wrong energy during unvoiced segments can 
cause an annoying high frequency fluctuation. 

The phase control can be done in several ways, mainly depending on the 
15 available bandwidth. In our implementation, a simple phase control is achieved 
during lost voiced onsets by searching the approximate information about the 
glottal pulse position. 

Hence, apart from the signal classification information discussed in the 
20 previous section, the most important information to send is the information about 
the signal energy and the position of the first glottal pulse in a frame (phase 
information). If enough bandwidth is available, a voicing information can be sent, 
too. 

25 Energy information 

The energy information can be estimated and sent either in the LP 
residual domain or in the speech signal domain. Sending the information in the 
residual domain has the disadvantage of not taking into account the influence of 
30 the LP synthesis filter. This can be particularly tricky in the case of voiced 
recovery after several lost voiced frames (when the PER happens during a voiced 
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speech segment). When a FER arrives after a voiced frame, the excitation of the 
last good frame is typically used during the concealment with some attenuation 
strategy. When a new LP synthesis filter arrives with the first good frame after the 
erasure, there can be a mismatch between the excitation energy and the gain of 
5 the LP synthesis filter. The new synthesis filter can produce a synthesis signal 
with an energy highly different from the energy of the last synthesized erased 
frame and also from the original signal energy. For this reason, the energy is 
computed and quantized in the signal domain. 

10 The energy Eg is computed and quantized in energy estimation and 

quantization module 506. It has been found that 6 bits are sufficient to transmit 
the energy. However, the number of bits can be reduced without a significant 
effect if not enough bits are available. In this preferred embodiment, a 6 bit 
uniform quantizer is used in the range of -15 dB to 83 dB with a step of 1 .58 dB. 

15 The quantization index is given by the integer part of: 



. 10log 10 (E + 0.001) + 15 

1.58 (15) 



where E is the maximum of the signal energy for frames classified as VOICED or 
20 ONSET, or the average energy per sample for other frames. For VOICED or 
ONSET frames, the maximum of signal energy is computed pitch synchronously 
at the end of the frame as follow: 



L-1 

E = max(s 2 (i)) 



25 



, (16) 

where L is the frame length and signal s(l) stands for speech signal (or the 
denoised speech signal if a noise suppression is used). In this illustrative 
embodiment s(i) stands for the input signal after downsampling to 12.8 kHz and 
pre-processing. If the pitch delay is greater than 63 samples, f£ equals the 
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rounded close-loop pitch lag of the last subframe. If the pitch delay is shorter than 
64 samples, then ^ is set to twice the rounded close-loop pitch lag of the last 
subframe. 

5 For other classes, E is the average energy per sample of the second half 

of the current frame, i.e. f£ is set to LI2 and the E is computed as: 

l E i=L-t E (17) 

10 Phase control information 

The phase control is particularly important while recovering after a lost 
segment of voiced speech for similar reasons as described in the previous 
section. After a block of erased frames, the decoder memories become 
15 desynchronized with the encoder memories. To resynchronize the decoder, some 
phase information can be sent depending on the available bandwidth. In the 
described illustrative implementation, a rough position of the first glottal pulse in 
the frame is sent. This information is then used for the recovery after lost voiced 
onsets as will be described later. 

20 

Let Tq be the rounded closed-loop pitch lag for the first subframe. First 
glottal pulse search and quantization module 507 searches the position of the first 
glottal pulse r among the To first samples of the frame by looking for the sample 
with the maximum amplitude. Best results are obtained when the position of the 
25 first glottal pulse is measured on the low-pass filtered residual signal. 

The position of the first glottal pulse is coded using 6 bits in the following 
manner. The precision used to encode the position of the first glottal pulse 
depends on the closed-loop pitch value for the first subframe Tq. This is possible 
30 because this value is known both by the encoder and the decoder, and is not 
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subject to error propagation after one or several frame losses. When To is less 
than 64, the position of the first glottal pulse relative to the beginning of the frame 
is encoded directly with a precision of one sample. When 64 = 7q <128, the 
position of the first glottal pulse relative to the beginning of the frame is encoded 

5 with a precision of two samples by using a simple integer division, i.e. z/2. When 
To = 128, the position of the first glottal pulse relative to the beginning of the 
frame is encoded with a precision of four samples by further dividing r by 2. The 
inverse procedure is done at the decoder. If To<64, the received quantized 
position is used as is. If 64 = To < 128, the received quantized position is 

0 multiplied by 2 and incremented by 1. If To = 128, the received quantized 
position is multiplied by 4 and incremented by 2 (incrementing by 2 results in 
uniformly distributed quantization error). 

According to another embodiment of the invention where the shape of the 
5 first glottal pulse is encoded, the position of the first glottal pulse is determined by 
a correlation analysis between the residual signal and the possible pulse shapes, 
signs (positive or negative) and positions. The pulse shape can be taken from a 
codebook of pulse shapes known at both the encoder and the decoder, this 
method being known as vector quantization by those of ordinary skill in the art. 
3 The shape, sign and amplitude of the first glottal pulse are then encoded and 
transmitted to the decoder. 

Periodicity information 

5 In case there is enough bandwidth, a periodicity information, or voicing 

information, can be computed and transmitted, and used at the decoder to 
improve the frame erasure concealment. The voicing information is estimated 
based on the normalized correlation. It can be encoded quite precisely with 4 bits, 
however, 3 or even 2 bits would suffice if necessary. The voicing information is 

) necessary in general only for frames with some periodic components and better 
voicing resolution is needed for highly voiced frames. The normalized correlation 
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is given in Equation (2) and it is used as an indicator to the voicing information. It 
is quantized in first glottal pulse search and quantization module 507. In this 
illustrative embodiment, a piece-wise linear quantizer has been used to encode 
the voicing information as follows: 



. = „(2)-0.65 +05 

0-03 : . f for r x (2) < 0.92 (18) 



. ft r x (2) -0.92 or 
i = 9 + + 0.5 

0-01 for r x (2) > 0,92 (19) 



10 Again, the integer part of / is encoded and transmitted. The correlation 

r x {2) has the same meaning as in Equation (1). In Equation (18) the voicing is 
linearly quantized between 0.65 and 0.89 with the step of 0.03. In Equation (19) 
the voicing is linearly quantized between 0.92 and 0.98 with the step of 0.01 . 

15 If larger quantization range is needed, the following linear quantization can 

be used: 



/= +0-5 

0.04 (20) 



20 This equation quantizes the voicing in the range of 0.4 to 1 with the step of 0.04. 
The correlation r * is defined in Equation (2a). 

The equations (18) and (19) or the equation (20) are then used in the 

decoder to compute r x (2) or r * . Let us call this quantized normalized correlation 
25 r q . If the voicing cannot be transmitted, it can be estimated using the voicing 
factor from Equation (2a) by mapping it in the range from 0 to 1 . 
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r q =0.5.(f + 1) (21) 

Processing of erased frames 

5 The FER concealment techniques in this illustrative embodiment are 

demonstrated on ACELP type encoders. They can be however easily applied to 
any speech codec where the synthesis signal is generated by filtering an 
excitation signal through an LP synthesis filter. The concealment strategy can be 
summarized as a convergence of the signal energy and the spectral envelope to 

10 the estimated parameters of the background noise. The periodicity of the signal is 
converging to zero. The speed of the convergence is dependent on the 
parameters of the last good received frame class and the number of consecutive 
erased frames and is controlled by an attenuation factor a. The factor a is further 
dependent on the stability of the LP filter for UNVOICED frames. In general, the 

15 convergence is slow if the last good received frame is in a stable segment and is 
rapid if the frame is in a transition segment. The values of a are summarized in 
Tables. 

Table 5. Values of the FER concealment attenuation factor a 

20 



Last <3ood Received \ 


Number of successive 




/Frame' 


erased frames 




ARTIFICIAL ONSET 




0.6 


ONSET, VOICED 


= 3 


1.0 




>3 


0.4 


VOICED TRANSITION 




0.4 


UNVOICED TRANSITION 




0.8 


UNVOICED 


= 1 


0.6 9 + 0.4 




> 1 


0.4 



A stability factor 0 is computed based on a distance measure between the 
adjacent LP filters. Here, the factor G is related to the ISF (Immittance Spectral 
Frequencies) distance measure and it is bounded by 0<£<1, with larger values of 
25 9 corresponding to more stable signals. This results in decreasing energy and 



BNSDOCID: <W O 03102921A1 I > 



WO03/102921 



PCT/CA03/00830 



45 

spectral envelope fluctuations when an isolated frame erasure occurs inside a 
stable unvoiced segment. 

The signal class remains unchanged during the processing of erased 
5 frames, i.e. the class remains the same as in the last good received frame. 

Construction of the periodic part of the excitation 

For a concealment of erased frames following a correctly received 
10 UNVOICED frame, no periodic part of the excitation signal is generated. For a 
concealment of erased frames following a correctly received frame other than 
UNVOICED, the periodic part of the excitation signal is constructed by repeating 
the last pitch period of the previous frame. If it is the case of the 1st erased frame 
after a good frame, this pitch pulse is first low-pass filtered. The filter used is a 
15 simple 3-tap linear phase FIR filter with filter coefficients equal to 0.18, 0.64 and 
0.18. If a voicing information is available, the filter can be also selected 
dynamically with a cut-off frequency dependent on the voicing. 

The pitch period T c used to select the last pitch pulse and hence used 
20 during the concealment is defined so that pitch multiples or submultiples can be 
avoided, or reduced. The following logic is used in determining the pitch period 
T c . 

if ((73 < 1 .8 T s ) AND (7 3 > 0.6 T s )) OR (T cnt = 30), then T c = T 3 , else T c = T s . 

25 

Here, T3 is the rounded pitch period of the 4 th subframe of the last good received 
frame and T s is the rounded pitch period of the 4 th subframe of the last good 
stable voiced frame with coherent pitch estimates. A stable voiced frame is 
defined here as a VOICED frame preceded by a frame of voiced type (VOICED 
30 TRANSITION, VOICED, ONSET). The coherence of pitch is verified in this 
implementation by examining whether the closed-loop pitch estimates are 
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reasonably close, i.e. whether the ratios between the last subframe pitch, the 2nd 
subframe pitch and the last subframe pitch of the previous frame are within the 
interval (0.7, 1.4). 

5 This determination of the pitch period Tc means that if the pitch at the end 

of the last good frame and the pitch of the last stable frame are close to each 
other, the pitch of the last good frame is used. Otherwise this pitch is considered 
unreliable and the pitch of the last stable frame is used instead to avoid the 
impact of wrong pitch estimates at voiced onsets. This logic makes however 

10 sense only if the last stable segment is not too far in the past. Hence a counter 
T C nt [s defined that limits the reach of the influence of the last stable segment. If 
T C nt ' s greater or equal to 30, i.e. if there are at least 30 frames since the last T s 
update, the last good frame pitch is used systematically. T cn t is reset to 0 every 
time a stable segment is detected and T s is updated. The period T c is then 

15 maintained constant during the concealment for the whole erased block. 

As the last pulse of the excitation of the previous frame is used for the 
construction of the periodic part, its gain is approximately correct at the beginning 
of the concealed frame and can be set to 1. The gain is then attenuated linearly 
20 throughout the frame on a sample by sample basis to achieve the value of a at 
the end of the frame. 

The values of a correspond to the Table 5 with the exception that they are 
modified for erasures following VOICED and ONSET frames to take into 
25 consideration the energy evolution of voiced segments. This evolution can be 
extrapolated to some extend by using the pitch excitation gain values of each 
subframe of the last good frame. In general, if these gains are greater than 1, the 
signal energy is increasing, if they are lower than 1 , the energy is decreasing, or is 
thus multiplied by a correction factor /fc> computed as follows: 

30 

f b = \jd. 1b(0) + 0.2b(1) + 0.3b(2) + 0.4b(3) (23) 
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where Jb(O), 6(1), b(2) and b(3) are the pitch gains of the four subframes of the 
last correctly received frame. The value of fjj is clipped between 0.98 and 0.85 
before being used to scale the periodic part of the excitation. In this way, strong 
5 energy increases and decreases are avoided. 

For erased frames following a correctly received frame other than 
UNVOICED, the excitation buffer is updated with this periodic part of the 
excitation only. This update will be used to construct the pitch codebook 
10 excitation in the next frame. 

Construction of the random part of the excitation 

The innovation (non-periodic) part of the excitation signal is generated 
15 randomly. It can be generated as a random noise or by using the CELP 
innovation codebook with vector indexes generated randomly. In the present 
illustrative embodiment, a simple random generator with approximately uniform 
distribution has been used. Before adjusting the innovation gain, the randomly 
generated innovation is scaled to some reference value, fixed here to the unitary 
20 energy per sample. 

At the beginning of an erased block, the innovation gain gs is initialized by 
using the innovation excitation gains of each subframe of the last good frame: 

25 g s =0.1g(0) + 0.2g(i) + 0.3g(2) + 0.4g(3) (23a) 

where g(0), g(1), gr(2) and g(3) are the fixed codebook, or innovation, gains of the 
four (4) subframes of the last correctly received frame. The attenuation strategy 
of the random part of the excitation is somewhat different from the attenuation of 
30 the pitch excitation. The reason is that the pitch excitation (and thus the excitation 
periodicity) is converging to 0 while the random excitation is converging to the 



BNSDOCID: <WO 031 02921 A1 ! > 



WO 03/102921 



PCT/CA03/00830 



48 

comfort noise generation (CNG) excitation energy. The innovation gain 
attenuation is done as: 

9 1 s=<X'9s+(1-<x)9n ( 24) 

5 

i o 
where ' 8 * is the innovation gain at the beginning of the next frame, gs is the 

innovative gain at the beginning of the current frame, 8 » is the gain of the 
excitation used during the comfort noise generation and a is as defined in Table 
5. Similarly to the periodic excitation attenuation, the gain is thus attenuated 

1 0 linearly throughout the frame on a sample by sample basis starting with Ss and 

going to the value of Ss that would be achieved at the beginning of the next 
frame. 

Finally, if the last good (correctly received or non erased) received frame 
15 is different from UNVOICED, the innovation excitation is filtered through a linear 
phase FIR high-pass filter with coefficients -0.0125, -0.109, 0.7813, -0.109, - 
0.0125. To decrease the amount of noisy components during voiced segments, 
these filter coefficients are multiplied by an adaptive factor equal to (0.75 - 0.25 r v 
), r v being the voicing factor as defined in -Equation (1). The random part of the 
20 excitation is then added to the adaptive excitation to form the total excitation 
signal. 

If the last good frame is UNVOICED, only the innovation excitation is used 
and it is further attenuated by a factor of 0.8. In this case, the past excitation 
25 buffer is updated with the innovation excitation as no periodic part of the 
excitation is available. 

Spectral Envelope Concealment, Synthesis and updates 
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To synthesize the decoded speech, the LP filter parameters must be 
obtained. The spectral envelope is gradually moved to the estimated envelope of 
the ambient noise. Here the ISF representation of LP parameters is used: 

5 t 1 (i) = al°Q) + (1-a)l n Q) t y =0 ,...,p-i (25) 

In equation (25), is the value of the fth ISF of the current frame, is the 
value of the j ih ISF of the previous frame, is the value of the ft* ISF of the 
estimated comfort noise envelope and p is the order of the LP filter. 

10 

The synthesized speech is obtained by filtering the excitation signal 
through the LP synthesis filter. The filter coefficients are computed from the ISF 
representation and are interpolated for each subframe (four (4) times per frame) 
as during normal encoder operation. 

15 

As innovation gain quantizer and ISF quantizer both use a prediction, their 
memory will not be up to date after the normal operation is resumed. To reduce 
this effect, the quantizers' memories are estimated and updated at the end of 
each erased frame. 

20 

Recovery of the normal operation after erasure 

The problem of the recovery after an erased block of frames is basically 
due to the strong prediction used practically in all modern speech encoders. In 
25 particular, the CELP type speech coders achieve their high signal to noise ratio 
for voiced speech due to the fact that they are using the past excitation signal to 
encode the present frame excitation (long-term or pitch prediction). Also, most of 
the quantizers (LP quantizers, gain quantizers) make use of a prediction. 

30 Artificial onset construction 
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The most complicated situation related to the use of the long-term 
prediction in CELP encoders is when a voiced onset is lost. The lost onset means 
that the voiced speech onset happened somewhere during the erased block. In 
this case, the last good received frame was unvoiced and thus no periodic 
5 excitation is found in the excitation buffer. The first good frame after the erased 
block is however voiced, the excitation buffer at the encoder is highly periodic and 
the adaptive excitation has been encoded using this periodic past excitation. As 
this periodic part of the excitation is completely missing at the decoder, it can take 
up to several frames to recover from this loss. 

10 

If an ONSET frame is lost (i.e. a VOICED good frame arrives after an 
erasure, but the last good frame before the erasure was UNVOICED as shown in 
Figure 6), a special technique is used to artificially reconstruct the lost onset and 
to trigger the voiced synthesis. At the beginning of the 1st good frame after a lost 

15 onset, the periodic part of the excitation is constructed artificially as a low-pass 
filtered periodic train of pulses separated by a pitch period. In the present 
illustrative embodiment, the low-pass filter is a simple linear phase FIR filter with 
the impulse response /)/ 0 w = {-0.0125, 0.109, 0.7813, 0.109, -0.0125}. However, 
the filter could be also selected dynamically with a cut-off frequency 

20 corresponding to the voicing information if this information is available. The 
innovative part of the excitation is constructed using normal CELP decoding. The 
entries of the innovation codebook could be also chosen randomly (or the 
innovation itself could be generated randomly), as the synchrony with the original 
signal has been lost anyway. 

25 

In practice, the length of the artificial onset is limited so that at least one 
entire pitch period is constructed by this method and the method is continued to 
the end of the current subframe. After that, a regular ACELP processing is 
resumed. The pitch period considered is the rounded average of the decoded 
30 pitch periods of all subframes where the artificial onset reconstruction is used. 
The low-pass filtered impulse train is realized by placing the impulse responses of 
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the low-pass filter in the adaptive excitation buffer (previously initialized to zero). 
The first impulse response will be centered at the quantized position r q 
(transmitted within the bitstream) with respect to the frame beginning and the 
remaining impulses will be placed with the distance of the averaged pitch up to 
5 the end of the last subframe affected by the artificial onset construction. If the 
available bandwidth is not sufficient to transmit the first glottal pulse position, the 
first impulse response can be placed arbitrarily around the half of the pitch period 
after the current frame beginning. 

10 As an example, for the subframe length of 64 samples, let us consider that 

the pitch periods in the first and the second subframe be p(0)=70.75 and p(1)=71. 
Since this is larger than the subrame size of 64, then the artificial onset will be 
constructed during the first two subframes and the pitch period will be equal to 
the pitch average of the two subframes rounded to the nearest integer, i.e. 71. 

15 The last two subframes will be processed by normal CELP decoder. 

The energy of the periodic part of the artificial onset excitation is then 
scaled by the gain corresponding to the quantized and transmitted energy for 
FER concealment (As defined in Equations 16 and 17) and divided by the gain of 
20 the LP synthesis filter. The LP synthesis filter gain is computed as: 



I 63 



f=0 



(31) 



where h(i) is the LP synthesis filter impulse response. Finally, the artificial onset 
25 gain is reduced by multiplying the periodic part with 0.96. Alternatively, this value 
could correspond to the voicing if there were a bandwidth available to transmit 
also the voicing information. Alternatively without diverting from the essence of 
this invention, the artificial onset can be also constructed in the past excitation 
buffer before entering the decoder subframe loop. This would have the advantage 
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of avoiding the special processing to construct the periodic part of the artificial 
onset and the regular CELP decoding could be used instead. 

The LP filter for the output speech synthesis is not interpolated in the case 
5 of an artificial onset construction. Instead, the received LP parameters are used 
for the synthesis of the whole frame. 

Energy control 

1 0 The most important task at the recovery after an erased block of frames is 

to properly control the energy of the synthesized speech signal. The synthesis 
energy control is needed because of the strong prediction usually used in modern 
speech coders. The energy control is most important when a block of erased 
frames happens during a voiced segment. When a frame erasure arrives after a 

15 voiced frame, the excitation of the last good frame is typically used during the 
concealment with some attenuation strategy. When a new LP filter arrives with 
the first good frame after the erasure, there can be a mismatch between the 
excitation energy and the gain of the new LP synthesis filter. The new synthesis 
filter can produce a synthesis signal with an energy highly different from the 

20 energy of the last synthesized erased frame and also from the original signal 
energy. 

The energy control during the first good frame after an erased frame can 
be summarized as follows. The synthesized signal is scaled so that its energy is 
25 similar to the energy of the synthesized speech signal at the end of the last 
erased frame at the beginning of the first good frame and is converging to the 
transmitted energy towards the end of the frame with preventing a too important 
energy increase. 

30 The energy control is done in the synthesized speech signal domain. Even 

if the energy is controlled in the speech domain, the excitation signal must be 
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scaled as it serves as long term prediction memory for the following frames. The 
synthesis is then redone to smooth the transitions. Let go denote the gain used to 
scale the 1st sample in the current frame and g-/ the gain used at the end of the 
frame. The excitation signal is then scaled as follows: 

5 

u s (i)-9 AG cO)u(i) /=0 L -i (32) 

where u s (i) is. the scaled excitation, u(i) is the excitation before the scaling, L is 
the frame length and gAGCO) is the gain starting from go and converging 
10 exponentially to gi : 

9agc (0 = f A Gc9 A Gc 0-V + (1- f AGC )9i j=0 M 

with the initialization of 8jac(~ l ). = So t wnere f AQC is the attenuation factor set in 
15 this implementation to the value of 0.98. This value has been found 
experimentally as a compromise of having a smooth transition from the previous 
(erased) frame on one side, and scaling the last pitch period of the current frame 
as much as possible to the correct (transmitted) value on the other side. This is 
important because the transmitted energy value is estimated pitch synchronously 
20 at the end of the frame. The gains gO and g1 are defined as: 

go=yjE_ l /E D (33a) 
9i = y/E/E, (33b) 

25 

where E.-f is the energy computed at the end of the previous (erased) frame, Eq 
is the energy at the beginning of the current (recovered) frame, £•/ is the energy 
at the end of the current frame and E q is the quantized transmitted energy 
information at the end of the current frame, computed at the encoder from 
30 Equations (16, 17). E-1 and E? are computed similarly with the exception that 
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they are computed on the synthesized speech signal s'. E.j is computed pitch 
synchronously using the concealment pitch period T c and E-j uses the last 
subframe rounded pitch T3. Eo is computed similarly using the rounded pitch 
value To of the first subframe, the equations (16, 17) being modified to: 

5 

E = maxts' 2 (i)) 

for VOICED and ONSET frames. tE equals to the rounded pitch lag or twice that 
length if the pitch is shorter than 64 samples. For other frames, 

10 

E = fl> ^ 

1 

with tE equal to the half of the frame length. The gains go and 91 are further 
limited to a maximum allowed value, to prevent strong energy. This value has 
15 been set to 1.2 in the present illustrative implementation. 

Conducting frame erasure concealment and decoder recovery comprises, 
when a gain of a LP filter of a first non erased frame received following frame 
erasure is higher than a gain of a LP filter of a last frame erased during said 
20 frame erasure, adjusting the energy of an LP filter excitation signal produced in 
the decoder during the received first non erased frame to a gain of the LP filter of 
said received first non erased frame using the following relation: 

If Eq cannot be transmitted, Eq is set to Ef. If however the erasure 
25 happens during a voiced speech segment (i.e. the last good frame before the 
erasure and the first good frame after the erasure are classified as VOICED 
TRANSITION, VOICED or ONSET), further precautions must be taken because 
of the possible mismatch between the excitation signal energy and the LP filter 
gain, mentioned previously. A particularly dangerous situation arises when the 
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} gain of the LP filter of a first non erased frame received following frame erasure is 
higher than the gain of the LP filter of a last frame erased during that frame 
erasure. In that particular case, the energy of the LP filter excitation signal 
produced in the decoder during the received first non erased frame is adjusted to 
5 a gain of the LP filter of the received first non erased frame using the following 
relation: 

10 where E^po is the energy of the LP filter impulse response of the last good 
frame before the erasure and Eipi is the energy of the LP filter of the first good 
frame after the erasure. In this implementation, the LP filters of the last 
subframes in a frame are used. Finally, the value of E q is limited to the value of 
E-1 in this case (voiced segment erasure without Eq information being 

15 transmitted). 

The following exceptions, all related to transitions in speech signal, further 
overwrite the computation of go. If artificial onset is used in the current frame, go 
is set to 0.5 gy, to make the onset energy increase gradually. 

20 

In the case of a first good frame after an erasure classified as ONSET, the 
gain go is prevented to be higher that g?. This precaution is taken to prevent a 
positive gain adjustment at the beginning of the frame (which is probably still at 
least partially unvoiced) from amplifying the voiced onset (at the end of the 
25 frame). 

Finally, during a transition from voiced to unvoiced (i.e. that last good 
frame being classified as VOICED TRANSITION, VOICED or ONSET and the 
current frame being classified UNVOICED) or during a transition from a non- 
30 active speech period to active speech period (last good received frame being 
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encoded as comfort noise and current frame being encoded as active speech), 
the go is set to gy. 

In case of a voiced segment erasure, the wrong energy problem can 
5 manifest itself also in frames following the first good frame after the erasure. This 
can happen even if the first good frame's energy has been adjusted as described 
above. To attenuate this problem, the energy control can be continued up to the 
end of the voiced segment. 

10 Although the present invention has been described in the foregoing 

description in relation to an illustrative embodiment thereof, this illustrative 
embodiment can be modified as will, within the scope of the appended claims 
without departing from the scope and spirit of the subject invention. 
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WHAT IS CLAIMED IS: 

1. A method for improving concealment of frame erasure caused by 
5 frames of an encoded sound signal erased during transmission from an encoder 
to a decoder, and for accelerating recovery of the decoder after non erased 
frames of the encoded sound signal have been received, comprising: 
determining, in the encoder, concealment/recovery parameters; 
transmitting to the decoder the concealment/recovery parameters 
10 determined in the encoder; and 

in the decoder, conducting erasure frame concealment and decoder 
recovery in response to the received concealment/recovery parameters. 



2. A method as defined in claim 1, further comprising quantizing, in the 
15 encoder, the concealment/recovery parameters prior to transmitting said 

concealment/recovery parameters to the decoder. 

3. A method as defined in claim 1, comprising determining, in the encoder, 
concealment/recovery parameters selected from the group consisting of: a signal 

20 classification parameter, an energy information parameter and a phase 
information parameter. 

4. A method as defined in claim 3, wherein determination of the phase 
information parameter comprises searching the position of a first glottal pulse in 

25 every frame of the encoded sound signal. 



5. A method as defined in claim 4, wherein determination of the phase 
information parameter further comprises encoding, in the encoder, the shape, 
sign and amplitude of the first glottal pulse and transmitting the encoded shape, 
30 sign and amplitude from the encoder to the decoder. 
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6. A method as defined in claim 4, wherein searching the position of the 
first glottal pulse comprises: 

measuring the first glottal pulse as a sample of maximum amplitude within 
a pitch period; and 

5 quantizing the position of the sample of maximum amplitude within the 

pitch period. 

7. A method as defined in claim 1, wherein: 
the sound signal is a speech signal; and 

10 determination, in the encoder, of concealment/recovery parameters 

comprises classifying successive frames of the encoded sound signal as 
unvoiced, unvoiced transition, voiced transition, voiced, or onset. 

8. A method as defined in claim 7, wherein classifying the successive 
15 frames comprises classifying as unvoiced every frame which is an unvoiced 

frame, every frame without active speech, and every voiced offset frame having 
an end tending to be unvoiced. 

9. A method as defined in claim 7, wherein classifying the successive 
20 frames comprises classifying as unvoiced transition every unvoiced frame having 

an end with a possible voiced onset which is too short or not built well enough to 
be processed as a voiced frame. 

10. A method as defined in claim 7, wherein classifying the successive 
25 frames comprises classifying as voiced transition every voiced frame with 

relatively weak voiced characteristics, including voiced frames with rapidly 
changing characteristics and voiced offsets lasting the whole frame, wherein a 
frame classified as voiced transition follows only frames classified as voiced 
transition, voiced or onset. 

30 
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11. A method as defined in claim 7, wherein classifying the successive 
frames comprises classifying as voiced every voiced frames with stable 
characteristics, wherein a frame classified as voiced follows only frames 
classified as voiced transition, voiced or onset. 

5 

12. A method as defined in claim 7, wherein classifying the successive 
frames comprises classifying as onset every voiced frame with stable 
characteristics following a frame classified as unvoiced or unvoiced transition. 

10 13. A method as defined in claim 7, comprising determining the 

classification of the successive frames of the encoded sound signal on the basis 
of at least a part of the following parameters: a normalized correlation parameter, 
a spectral tilt parameter, a signal-to-noise ratio parameter, a pitch stability 
parameter, a relative frame energy parameter, and a zero crossing parameter. 

15 

14. A method as defined in claim 13, wherein determining the 
classification of the successive frames comprises: 

computing a figure of merit on the basis of the normalized correlation 
parameter, spectral tilt parameter, signal-to-noise ratio parameter, pitch stability 
20 parameter, relative frame energy parameter, and zero crossing parameter; and 
comparing the figure of merit to thresholds to determine the classification. 

15. A method as defined in claim 13, comprising calculating the 
normalized correlation parameter on the basis of a current weighted version of 

25 the speech signal and a past weighted version of said speech signal. 

16. A method as defined in claim 13, comprising estimating the spectral tilt 
parameter as a ratio between an energy concentrated in low frequencies and an 
energy concentrated in high frequencies. 

30 
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17. A method as defined in claim 13, comprising estimating the signal-to- 
noise ratio parameter as a ratio between an energy of a weighted version of the 
speech signal of a current frame and an energy of an error between said 
weighted version of the speech signal of the current frame and a weighted 

5 version of a synthesized speech signal of said current frame. 

18. A method as defined in claim 13, comprising computing the pitch 
stability parameter in response to open-loop pitch estimates for a first half of a 
current frame, a second half of the current frame and a look-ahead. 

10 

19. A method as defined in claim 13, comprising computing the relative 
frame energy parameter as a difference between an energy of a current frame 
and a long-term average of an energy of active speech frames. 

15 20. A method as defined in claim 13, comprising determining the zero- 

crossing parameter as a number of times a sign of the speech signal changes 
from a first polarity to a second polarity. 

21. A method as defined in claim 13, comprising computing at least one of 
20 the normalized correlation parameter, spectral tilt parameter, signal-to-noise ratio 
parameter, pitch stability parameter, relative frame energy parameter, and zero 
crossing parameter using an available look-ahead to take into consideration the 
behavior of the speech signal in the following frame. 

25 22. A method as defined in claim 13, further comprising determining the 

classification of the successive frames of the encoded sound signal also on the 
basis of a voice activity detection flag. 

23 A method as defined in claim 3, wherein: 
30 the sound signal is a speech signal; 
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determination, in the encoder, of concealment/recovery parameters 
comprises classifying successive frames of the encoded sound signal as 
unvoiced, unvoiced transition, voiced transition, voiced, or onset; and 

determining concealment/recovery parameters comprises calculating the 
5 energy information parameter in relation to a maximum of a signal energy for 
frames classified as voiced or onset, and calculating the energy information 
parameter in relation to an average energy per sample for other frames. 

24. A method as defined in claim 1, wherein determining, in the encoder, 
10 concealment/recovery parameters comprises computing a voicing information 

parameter. 

25. A method as defined in claim 24, wherein: 
the sound signal is a speech signal; 

15 determination, in the encoder, of concealment/recovery parameters 

comprises classifying successive frames of the encoded sound signal; 

said method comprises determining the classification of the successive 
frames of the encoded sound signal on the basis of a normalized correlation 
parameter; and 

20 computing the voicing information parameter comprises estimating said 

voicing information parameter on the basis of the normalized correlation. 

26. A method as defined in claim 1, wherein conducting frame erasure 
concealment and decoder recovery comprises: 

25 following receiving a non erased unvoiced frame after frame erasure, 

generating no periodic part of a LP filter excitation signal; 

following receiving, after frame erasure, of a non erased frame other than 
unvoiced, constructing a periodic part of the LP filter excitation signal by 
repeating a last pitch period of a previous frame. 

30 
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27. A method as defined in claim 26, wherein constructing the periodic 
part of the LP filter excitation signal comprises filtering the repeated last pitch 
period of the previous frame through a low-pass filter. 

5 28. A method as defined in claim 27, wherein: 

determining concealment/recovery parameters comprises computing a 
voicing information parameter; 

the low-pass filter has a cut-off frequency; and 

constructing the periodic part of the excitation signal comprises 
10 dynamically adjusting the cut-off frequency in relation to the voicing information 
parameter. 

29. A method as defined in claim 1, wherein conducting frame erasure 
concealment and decoder recovery comprises randomly generating a non- 
15 periodic, innovation part of a LP filter excitation signal. 

30. A method as defined in claim 29, wherein randomly generating the 
non-periodic, innovation part of the LP filter excitation signal comprises 
generating a random noise. 

20 

31. A method as defined in claim 29, wherein randomly generating the 
non-periodic, innovation part of the LP filter excitation signal comprises randomly 
generating vector indexes of an innovation codebook. 

25 32. A method as defined in claim 29, wherein: 

the sound signal is a speech signal; 

determination of concealment/recovery parameters comprises classifying 
successive frames of the encoded sound signal as unvoiced, unvoiced transition, 
voiced transition, voiced, or onset; and 
30 randomly generating the non-periodic, innovation part of the LP filter 

excitation signal further comprises: 
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• if the last correctly received frame is different from unvoiced, 
filtering the innovation part of the excitation signal through a high pass 
filter; and 

• if the last correctly received frame is unvoiced, using only the 
5 innovation part of the excitation signal. 

33. A method as defined in claim 1, wherein: 
the sound signal is a speech signal; 

determination, in the encoder, of concealment/recovery parameters 
10 comprises classifying successive frames of the encoded sound signal as 
unvoiced, unvoiced transition, voiced transition, voiced, or onset; 

conducting frame erasure concealment and decoder recovery comprises, 
when an onset frame is lost which is indicated by the presence of a voiced frame 
following frame erasure and an unvoiced frame before frame erasure, artificially 
15 reconstructing the lost onset by constructing a periodic part of an excitation signal 
as a low-pass filtered periodic train of pulses separated by a pitch period. 

34. A method as defined in claim 33, wherein conducting frame erasure 
concealment and decoder recovery further comprises constructing an innovation 

20 part of the excitation signal by means of normal decoding. 

35. A method as defined in claim 34, wherein constructing an innovation 
part of the excitation signal comprises randomly choosing entries of an innovation 
codebook. 



25 



36. A method as defined in claim 33, wherein artificially reconstructing the 
lost onset comprises limiting a length of the artificially reconstructed onset so that 
at least one entire pitch period is constructed by the onset artificial reconstruction, 
said reconstruction being continued until the end of a current subframe. 



30 
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37. A method as defined in claim 36, wherein conducting frame erasure 
concealment and decoder recovery further comprises, after artificial 
reconstruction of the lost onset, resuming a regular CELP processing wherein the 
pitch period is a rounded average of decoded pitch periods of all subframes 

5 where the artificial onset reconstruction is used. 

38. A method as defined in claim 3, wherein conducting frame erasure 
concealment and decoder recovery comprises: 

controlling an energy of a synthesized sound signal produced by the 
10 decoder, controlling energy of the synthesized sound signal comprising scaling 
the synthesized sound signal to render an energy of said synthesized sound 
signal at the beginning of a first non erased frame received following frame 
erasure similar to an energy of said synthesized signal at the end of a last frame 
erased during said frame erasure; and 
15 converging the energy of the synthesized sound signal in the received first 

non erased frame to an energy corresponding to the received energy information 
parameter toward the end of said received first non erased frame while limiting an 
increase in energy. 

20 39. A method as defined in claim 3, wherein: 

the energy information parameter is not transmitted from the encoder to 
the decoder; and 

conducting frame erasure concealment and decoder recovery comprises, 
when a gain of a LP filter of a first non erased frame received following frame 
25 erasure is higher than a gain of a LP filter of a last frame erased during said 
frame erasure, adjusting the energy of an LP filter excitation signal produced in 
the decoder during the received first non erased frame to a gain of the LP filter of 
said received first non erased frame. 

30 40. A method as defined in claim 39 wherein: 
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adjusting the energy of an LP filter excitation signal produced in the 
decoder during the received first non erased frame to a gain of the LP filter of 
said received first non erased frame comprises using the following relation: 

O tZLP1 

where E1 is the energy at the end of the current frame, Elpo '\s the energy of an 
impulse response of the LP filter to the last non erased frame received before the 
frame erasure, and Elpi is the energy of the impulse response of the LP filter to 
10 the received first non erased frame following frame erasure. 

41. A method as defined in claim 38, wherein: 
the sound signal is a speech signal; 

determination, in the encoder, of concealment/recovery parameters 
15 comprises classifying successive frames of the encoded sound signal as 
unvoiced, unvoiced transition, voiced transition, voiced, or onset; and 

when the first non erased frame received after a frame erasure is 
classified as ONSET, conducting frame erasure concealment and decoder 
recovery comprises limiting to a given value a gain used for scaling the 
20 synthesized sound signal. 

42. A method as defined in claim 38, wherein: 
the sound signal is a speech signal; 

determination, in the encoder, of concealment/recovery parameters 
25 comprises classifying successive frames of the encoded sound signal as 
unvoiced, unvoiced transition, voiced transition, voiced, or onset; and 

said method comprising making a gain used for scaling the synthesized 
sound signal at the beginning of the first non erased frame received after frame 
erasure equal to a gain used at the end of said received first non erased frame: 
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• during a transition from a voiced frame to an unvoiced frame, in 

the case of a last non erased frame received before frame erasure 
classified as voiced transition, voice or onset and a first non erased 
frame received after frame erasure classified as unvoiced; and 
5 • during a transition from a non-active speech period to an active 

speech period, when the last non erased frame received before frame 
erasure is encoded as comfort noise and the first non erased frame 
received after frame erasure is encoded as active speech. 

10 43. A method for the concealment of frame erasure caused by frames 

erased during transmission of a sound signal encoded under the form of signal- 
encoding parameters from an encoder to a decoder, and for accelerating 
recovery of the decoder after non erased frames of the encoded sound signal 
have been received, comprising: 

15 determining, in the decoder, concealment/recovery parameters from the 

signal-encoding parameters; 

in the decoder, conducting erased frame concealment and decoder 
recovery in response to the determined concealment/recovery parameters. 

20 44. A method as defined in claim 43, comprising determining, in the 

decoder, concealment/recovery parameters selected from the group consisting 
of: a signal classification parameter, an energy information parameter and a 
phase information parameter. 

25 45. A method as defined in claim 43, wherein: 

the sound signal is a speech signal; and 

determination, in the decoder, of concealment/recovery parameters 
comprises classifying successive frames of the encoded sound signal as 
unvoiced, unvoiced transition, voiced transition, voiced, or onset. 

30 
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46. A method as defined in claim 43, wherein determining, in the decoder, 
concealment/recovery parameters comprises computing a voicing information 
parameter. 

5 47. A method as defined in claim 43, wherein conducting frame erasure 

concealment and decoder recovery comprises: 

following receiving a non erased unvoiced frame after frame erasure, 
generating no periodic part of a LP filter excitation signal; 

following receiving, after frame erasure, of a non erased frame other than 
10 unvoiced, constructing a periodic part of the LP filter excitation signal by 
repeating a last pitch period of a previous frame. 

48. A method as defined in claim 47, wherein constructing the periodic 
part of the excitation signal comprises filtering the repeated last pitch period of 

15 the previous frame through a low-pass filter. 

49. A method as defined in claim 48, wherein: 

determining, in the decoder, concealment/recovery parameters comprises 
computing a voicing information parameter; 
20 the low-pass filter has a cut-off frequency; and 

constructing the periodic part of the LP filter excitation signal comprises 
dynamically adjusting the cut-off frequency in relation to the voicing information 
parameter. 

25 50. A method as defined in claim 43, wherein conducting frame erasure 

concealment and decoder recovery comprises randomly generating a non- 
periodic, innovation part of a LP filter excitation signal. 

51. A method as defined in claim 50, wherein randomly generating the 
30 non-periodic, innovation part of the LP filter excitation signal comprises 
generating a random noise. 
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52. A method as defined in claim 50, wherein randomly generating the 
non-periodic, innovation part of the LP filter excitation signal comprises randomly 
generating vector indexes of an innovation codebook. 

5 

53. A method as defined in claim 50, wherein: 
the sound signal is a speech signal; 

determination, in the decoder, of concealment/recovery parameters 
comprises classifying successive frames of the encoded sound signal as 
10 unvoiced, unvoiced transition, voiced transition, voiced, or onset; and 

randomly generating the non-periodic, innovation part of the LP filter 
excitation signal further comprises: 

• if the last received non erased frame is different from unvoiced, 
filtering the innovation part of the LP filter excitation signal through a 

15 high pass filter; and 

• if the last received non erased frame is unvoiced, using only the 
innovation part of the LP filter excitation signal. 

54. A method as defined in claim 50, wherein: 
20 the sound signal is a speech signal; 

determination, in the decoder, of concealment/recovery parameters 
comprises classifying successive frames of the encoded sound signal as 
unvoiced, unvoiced transition, voiced transition, voiced, or onset; 

conducting frame erasure concealment and decoder recovery comprises, 
25 when an onset frame is lost which is indicated by the presence of a voiced frame 
following frame erasure and an unvoiced frame before frame erasure, artificially 
reconstructing the lost onset by constructing a periodic part of an excitation signal 
as a low-pass filtered periodic train of pulses separated by a pitch period. 
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55. A method as defined in claim 54, wherein conducting frame erasure 
concealment and decoder recovery further comprises constructing an innovation 
part of the LP filter excitation signal by means of norma! decoding. 



part of the LP filter excitation signal comprises randomly choosing entries of an 
innovation codebook. 

57. A method as defined in claim 54, wherein artificially reconstructing the 
10 lost onset comprises limiting a length of the artificially reconstructed onset so that 

at least one entire pitch period is constructed by the onset artificial reconstruction, 
said reconstruction being continued until the end of a current subframe. 

58. A method as defined in claim 57, wherein conducting frame erasure 
15 concealment and decoder recovery further comprises, after artificial 

reconstruction of the lost onset, resuming a regular CELP processing wherein the 
pitch period is a rounded average of decoded pitch periods of all subframes 
where the artificial onset reconstruction is used. 

20 59. A method as defined in claim 44, wherein: 

the energy information parameter is not transmitted from the encoder to 
the decoder; and 

conducting frame erasure concealment and decoder recovery comprises, 
when a gain of a LP filter of a first non erased frame received following frame 
25 erasure is higher than a gain of a LP filter of a last frame erased during said 
frame erasure, adjusting the energy of an LP filter excitation signal produced in 
the decoder during the received first non erased frame to a gain of the LP filter of 
said received first non erased frame using the following relation: 



5 



56. A method as defined in claim 55, wherein constructing an innovation 
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where is the energy at the end of the current frame, Elpo is the energy of an 
impulse response of the LP filter to the last non erased frame received before the 
frame erasure, and E\_pj is the energy of the impulse response of the LP filter to 
5 the received first non erased frame following frame erasure. 

60. A device for improving concealment of frame erasure caused by 
frames of an encoded sound signal erased during transmission from an encoder 
to a decoder, and for accelerating recovery of the decoder after non erased 

10 frames of the encoded sound signal have been received, comprising: 

means for determining, in the encoder, concealment/recovery parameters; 
means for transmitting to the decoder the concealment/recovery 
parameters determined in the encoder; and 

in the decoder, means for conducting erasure frame concealment and 

15 decoder recovery in response to the received concealment/recovery parameters. 

61. A device as defined in claim 60, further comprising means for 
quantizing, in the encoder, the concealment/recovery parameters prior to 
transmitting said concealment/recovery parameters to the decoder. 

20 

62. A device as defined in claim 60, comprising means for determining, in 
the encoder, concealment/recovery parameters selected from the group 
consisting of: a signal classification parameter, an energy information parameter 
and a phase information parameter. 

25 

63. A device as defined in claim 62, wherein the means for determining 
the phase information parameter comprises means for searching the position of a 
first glottal pulse in every frame of the encoded sound signal. 

30 64. A device as defined in claim 63, wherein the means for determining 

the phase information parameter further comprises means for encoding, in the 
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encoder, the shape, sign and amplitude of the first glottal pulse and means for 
transmitting the encoded shape, sign and amplitude from the encoder to the 
decoder. 

65. A device as defined in claim 63, wherein the means for searching the 
position of the first glottal pulse comprises: 

means for measuring the first glottal pulse as a sample of maximum 
amplitude within a pitch period; and 

means for quantizing the position of the sample of maximum amplitude 
within the pitch period. 

66. A device as defined in claim 60, wherein: 
the sound signal is a speech signal; and 

the means for determining, in the encoder, concealment/recovery 
15 parameters comprises means for classifying successive frames of the encoded 
sound signal as unvoiced, unvoiced transition, voiced transition, voiced, or onset. 

67. A device as defined in claim 66, wherein the means for classifying the 
successive frames comprises means for classifying as unvoiced every frame 

20 which is an unvoiced frame, every frame without active speech, and every voiced 
offset frame having an end tending to be unvoiced. 

68. A device as defined in claim 66, wherein the means for classifying the 
successive frames comprises means for classifying as unvoiced transition every 

25 unvoiced frame having an end with a possible voiced onset which is too short or 
not built well enough to be processed as a voiced frame. 

69. A device as defined in claim 66, wherein the means for classifying the 
successive frames comprises means for classifying as voiced transition every 

30 voiced frame with relatively weak voiced characteristics, including voiced frames 
with rapidly changing characteristics and voiced offsets lasting the whole frame, 
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wherein a frame classified as voiced transition follows only frames classified as 
voiced transition, voiced or onset. 

70. A device as defined in claim 66, wherein the means for classifying the 
5 successive frames comprises means for classifying as voiced every voiced 

frames with stable characteristics, wherein a frame classified as voiced follows 
only frames classified as voiced transition, voiced or onset. 

71. A device as defined in claim 66, wherein the means for classifying the 
10 successive frames comprises means for classifying as onset every voiced frame 

with stable characteristics following a frame classified as unvoiced or unvoiced 
transition. 

72. A device as defined in claim 66, comprising means for determining the 
15 classification of the successive frames of the encoded sound signal on the basis 

of at least a part of the following parameters: a normalized correlation parameter, 
a spectral tilt parameter, a signal-to-noise ratio parameter, a pitch stability 
parameter, a relative frame energy parameter, and a zero crossing parameter. 

20 73. A device as defined in claim 72, wherein the means for determining 

the classification of the successive frames comprises: 

means for computing a figure of merit on the basis of the normalized 
correlation parameter, spectral tilt parameter, signal-to-noise ratio parameter, 
pitch stability parameter, relative frame energy parameter, and zero crossing 
25 parameter; and 

means for comparing the figure of merit to thresholds to determine the 
classification. 

74. A device as defined in claim 72, comprising means for calculating the 
30 normalized correlation parameter on the basis of a current weighted version of 
the speech signal and a past weighted version of said speech signal. 
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75. A device as defined in claim 72, comprising means for estimating the 
spectral tilt parameter as a ratio between an energy concentrated in low 
frequencies and an energy concentrated in high frequencies. 

5 

76. A device as defined in claim 72, comprising means for estimating the 
signal-to-noise ratio parameter as a ratio between an energy of a weighted 
version of the speech signal of a current frame and an energy of an error 
between said weighted version of the speech signal of the current frame and a 

10 weighted version of a synthesized speech signal of said current frame. 



77. A device as defined in claim 72, comprising means for computing the 
pitch stability parameter in response to open-loop pitch estimates for a first half of 
a current frame, a second half of the current frame and a look-ahead. 

15 

78. A device as defined in claim 72, comprising means for computing the 
relative frame energy parameter as a difference between an energy of a current 
frame and a long-term average of an energy of active speech frames. 

20 79. A device as defined in claim 72, comprising means for determining the 

zero-crossing parameter as a number of times a sign of the speech signal 
changes from a first polarity to a second polarity. 

80. A device as defined in claim 72, comprising means for computing at 
25 least one of the normalized correlation parameter, spectral tilt parameter, signal- 
to-noise ratio parameter, pitch stability parameter, relative frame energy 
parameter, and zero crossing parameter using an available look-ahead to take 
into consideration the behavior of the speech signal in the following frame. 
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81. A device as defined in claim 72, further comprising means for 
determining the classification of the successive frames of the encoded sound 
signal also on the basis of a voice activity detection flag. 

5 82. A device as defined in claim 62, wherein: 

the sound signal is a speech signal; 

the means for determining, in the encoder, concealment/recovery 
parameters comprises means for classifying successive frames of the encoded 
sound signal as unvoiced, unvoiced transition, voiced transition, voiced, or onset; 
10 and 

the means for determining concealment/recovery parameters comprises 
means for calculating the energy information parameter in relation to a maximum 
of a signal energy for frames classified as voiced or onset, and means for 
calculating the energy information parameter in relation to an average energy per 
15 sample for other frames. 

83. A device as defined in claim 60, wherein the means for determining, in 
the encoder, concealment/recovery parameters comprises means for computing 
a voicing information parameter. 

20 

84. A device as defined in claim 83, wherein: 
the sound signal is a speech signal; 

the means for determining, in the encoder, concealment/recovery 
parameters comprises means for classifying successive frames of the encoded 
25 sound signal; 

said device comprises means for determining the classification of the 
successive frames of the encoded sound signal on the basis of a normalized 
correlation parameter; and 

the means for computing the voicing information parameter comprises 
30 means for estimating said voicing information parameter on the basis of the 
normalized correlation. 
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85. A device as defined in claim 60, wherein the means for conducting 
frame erasure concealment and decoder recovery comprises: 

following receiving a non erased unvoiced frame after frame erasure, 
5 means for generating no periodic part of a LP filter excitation signal; 

following receiving, after frame erasure, of a non erased frame other than 
unvoiced, means for constructing a periodic part of the LP filter excitation signal 
by repeating a last pitch period of a previous frame. 

10 86. A device as defined in claim 85, wherein the means for constructing 

the periodic part of the LP filter excitation signal comprises a low-pass filter for 
filtering the repeated last pitch period of the previous frame. 

87. A device as defined in claim 86, wherein: 

15 the means for determining concealment/recovery parameters comprises 

means for computing a voicing information parameter; 
the low-pass filter has a cut-off frequency; and 

the means for constructing the periodic part of the excitation signal 
comprises means for dynamically adjusting the cut-off frequency in relation to the 
20 voicing information parameter. 

88. A device as defined in claim 60, wherein the means for conducting 
frame erasure concealment and decoder recovery comprises means for randomly 
generating a non-periodic, innovation part of a LP filter excitation signal. 

25 

89. A device as defined in claim 88, wherein the means for randomly 
generating the non-periodic, innovation part of the LP filter excitation signal 
comprises means for generating a random noise. 

30 90. A device as defined in claim 88, wherein the means for randomly 

generating the non-periodic, innovation part of the LP filter excitation signal 
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comprises means for randomly generating vector indexes of an innovation 
codebook. 

91. A device as defined in claim 88, wherein: 
5 the sound signal is a speech signal; 

the means for determining concealment/recovery parameters comprises 
means for classifying successive frames of the encoded sound signal as 
unvoiced, unvoiced transition, voiced transition, voiced, or onset; and 

the means for randomly generating the non-periodic, innovation part of the 
1 0 LP filter excitation signal further comprises: 

• if the last correctly received frame is different from unvoiced, a 
high-pass filter for filtering the innovation part of the excitation signal; 
and 

• if the last correctly received frame is unvoiced, means for using 
15 only the innovation part of the excitation signal. 

92. A device as defined in claim 60, wherein: 
the sound signal is a speech signal; 

the means for determining, in the encoder, concealment/recovery 
20 parameters comprises means for classifying successive frames of the encoded 
sound signal as unvoiced, unvoiced transition, voiced transition, voiced, or onset; 

the means for conducting frame erasure concealment and decoder 
recovery comprises, when an onset frame is lost which is indicated by the 
presence of a voiced frame following frame erasure and an unvoiced frame 
25 before frame erasure, means for artificially reconstructing the lost onset by 
constructing a periodic part of an excitation signal as a low-pass filtered periodic 
train of pulses separated by a pitch period. 

93. A device as defined in claim 92, wherein the means for conducting 
30 frame erasure concealment and decoder recovery further comprises means for 
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constructing an innovation part of the excitation signal by means of normal 
decoding. 

94. A device as defined in claim 93, wherein the means for constructing an 
5 innovation part of the excitation signal comprises means for randomly choosing 

entries of an innovation codebook. 

95. A device as defined in claim 92, wherein the means for artificially 
reconstructing the lost onset comprises means for limiting a length of the 

10 artificially reconstructed onset so that at least one entire pitch period is 
constructed by the onset artificial reconstruction, said reconstruction being 
continued until the end of a current subframe. 

96. A device as defined in claim 95, wherein the means for conducting 
1 5 frame erasure concealment and decoder recovery further comprises, after 

artificial reconstruction of the lost onset, means for resuming a regular CELP 
processing wherein the pitch period is a rounded average of decoded pitch 
periods of all subframes where the artificial onset reconstruction is used. 

20 97. -A device as defined in claim 62, wherein the means for conducting 

frame erasure concealment and decoder recovery comprises: 

means for controlling an energy of a synthesized sound signal produced 
by the decoder, the means for controlling energy of the synthesized sound signal 
comprising means for scaling the synthesized sound signal to render an energy 

25 of said synthesized sound signal at the beginning of a first non erased frame 
received following frame erasure similar to an energy of said synthesized signal 
at the end of a last frame erased during said frame erasure; and 

means for converging the energy of the synthesized sound signal in the 
received first non erased frame to an energy corresponding to the received 

30 energy information parameter toward the end of said received first non erased 
frame while limiting an increase in energy. 
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98. A device as defined in claim 62, wherein: 

the energy information parameter is not transmitted from the encoder to 
the decoder; and 

5 the means for conducting frame erasure concealment and decoder 

recovery comprises, when a gain of a LP filter of a first non erased frame 
received following frame erasure is higher than a gain of a LP filter of a last frame 
erased during said frame erasure, means for adjusting the energy of an LP filter 
excitation signal produced in the decoder during the received first non erased 
i0 frame to a gain of the LP filter of said received first non erased frame. . 



99. A device as defined in claim 98, wherein: 

the means for adjusting the energy of an LP filter excitation signal 
produced in the decoder during the received first non erased frame to a gain of 
15 the LP filter of said received first non erased frame comprises means for using 
the following relation: 

Elpi 

20 where E? is the energy at the end of the current frame, E/_PO is the energy of an 
impulse response of the LP filter to the last non erased frame received before the 
frame erasure, and Elpi is the energy of the impulse response of the LP filter to 
the received first non erased frame following frame erasure. 

25 100. A device as defined in claim 97, wherein: 

the sound signal is a speech signal; 

the means for determining, in the encoder, concealment/recovery 
parameters comprises means for classifying successive frames of the encoded 
sound signal as unvoiced, unvoiced transition, voiced transition, voiced, or onset; 
30 and 
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when the first non erased frame received after a frame erasure is 
classified as ONSET, the means for conducting frame erasure concealment and 
decoder recovery comprises means for limiting to a given value a gain used for 
scaling the synthesized sound signal. 

5 

101. A device as defined in claim 97, wherein: 
the sound signal is a speech signal; 

the means for determining, in the encoder, concealment/recovery 
parameters comprises means for classifying successive frames of the encoded 
10 sound signal as unvoiced, unvoiced transition, voiced transition, voiced, or onset; 
and 

said device comprising means for making a gain used for scaling the 
synthesized sound signal at the beginning of the first non erased frame received 
after frame erasure equal to a gain used at the end of said received first non 
15 erased frame: 

• during a transition from a voiced frame to an unvoiced frame, in 

the case of a last non erased frame received before frame erasure 
classified as voiced transition, voice or onset and a first non erased 
frame received after frame erasure classified as unvoiced; and 
20 • during a transition from a non-active speech period to an active 

speech period, when the last non erased frame received before frame 
erasure is encoded as comfort noise and the first non erased frame 
received after frame erasure is encoded as active speech. 

25 102. A device for the concealment of frame erasure caused by frames 

erased during transmission of a sound signal encoded under the form of signal- 
encoding parameters from an encoder to a decoder, and for accelerating 
recovery of the decoder after non erased frames of the encoded sound signal 
have been received, comprising: 

JO means for determining, in the decoder, concealment/recovery parameters 

• from the signal-encoding parameters; 
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in the decoder, means for conducting erased frame concealment and 
decoder recovery in response to the determined concealment/recovery 
parameters. 

5 1 03. A device as defined in claim 1 02, comprising means for determining, 

in the decoder, concealment/recovery parameters selected from the group 
consisting of: a signal classification parameter, an energy information parameter 
and a phase information parameter. 

1 0 1 04. A device as defined in claim 1 02, wherein: 

the sound signal is a speech signal; and 

the means for determining, in the decoder, concealment/recovery 
parameters comprises means for classifying successive frames of the encoded 
sound signal as unvoiced, unvoiced transition, voiced transition, voiced, or onset. 

15 

105. A device as defined in claim 102, wherein the means for determining, 
in the decoder, concealment/recovery parameters comprises means for 
computing a voicing information parameter. 

20 10e - A device as defined in claim 102, wherein the means for conducting 

frame erasure concealment and decoder recovery comprises: 

following receiving a non erased unvoiced frame after frame erasure, 
means for generating no periodic part of a LP filter excitation signal; 

following receiving, after frame erasure, of a non erased frame other than 
25 unvoiced, means- for constructing a periodic part of the LP filter excitation signal 
by repeating a last pitch period of a previous frame. 

' 107. A device as defined in claim 106, wherein the means for constructing 
the periodic part of the excitation signal comprises a low-pass filter for filtering the 
30 repeated last pitch period of the previous frame. 
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108. A device as defined in claim 107, wherein: 

the means for determining, in the decoder, concealment/recovery 
parameters comprises means for computing a voicing information parameter; 

the low-pass filter has a cut-off frequency; and 
5 the means for constructing the periodic part of the LP filter excitation 

signal comprises means for dynamically adjusting the cut-off frequency in relation 
to the voicing information parameter. 

109. A device as defined in claim 102, wherein the means for conducting 
10 frame erasure concealment and decoder recovery comprises means for randomly 

generating a non-periodic, innovation part of a LP filter excitation signal. 

110. A device as defined in claim 109, wherein the means for randomly 
generating the non-periodic, innovation part of the LP filter excitation signal 

15 comprises means for generating a random noise. 

111. A device as defined in claim 109,. wherein the means for randomly 
generating the non-periodic, innovation part of the LP filter excitation signal 
comprises means for randomly generating vector indexes of an innovation 

20 codebook. 

1 12. A device as defined in claim 109, wherein: 
the sound signal is a speech signal; 

the means for determination, in the decoder, concealment/recovery 
25 parameters comprises means for classifying successive frames of the encoded 
sound signal as unvoiced, unvoiced transition, voiced transition, voiced, or onset; 
and 

the means for randomly generating the non-periodic, innovation part of the 
LP filter excitation signal further comprises: 
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• if the last received non erased frame is different from unvoiced, 
a high-pass filter for filtering the innovation part of the LP filter 
excitation signal; and 

• if the last received non erased frame is unvoiced, means for 
5 using only the innovation part of the LP filter excitation signal. 

113. A device as defined in claim 109, wherein: 
the sound signal is a speech signal; 

the means for determining, in the decoder, concealment/recovery 
10 parameters comprises means for classifying successive frames of the encoded 
sound signal as unvoiced, unvoiced transition, voiced transition, voiced, or onset; 

the means for conducting frame erasure concealment and decoder 
recovery comprises, when an onset frame is lost which is indicated by the 
presence of a voiced frame following frame erasure and an unvoiced frame 
15 before frame erasure, means for artificially reconstructing the lost onset by 
constructing a periodic part of an excitation signal as a low-pass filtered periodic 
train of pulses separated by a pitch period. 

114. A device as defined in claim 113, wherein the means for conducting 
20 frame erasure concealment and decoder recovery further comprises means for 

constructing an innovation part of the LP filter excitation signal by means of 
normal decoding. 

115. A device as defined in claim 114, wherein the means for constructing 
25 an innovation part of the LP filter excitation signal comprises means for randomly 

choosing entries of an innovation codebook. 

116. A device as defined in claim 113, wherein the means for artificially 
reconstructing the lost onset comprises means for limiting a length of the 

30 artificially reconstructed onset so that at least one entire pitch period is 
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constructed by the onset artificial reconstruction, said reconstruction being 
continued until the end of a current subframe. 

117. A device as defined in claim 116, wherein the means for conducting 
5 frame erasure concealment and decoder recovery further comprises, after 
artificial reconstruction of the lost onset, means for resuming a regular CELP 
processing wherein the pitch period is a rounded average of decoded pitch 
periods of all subframes where the artificial onset reconstruction is used. 

10 1 18. A device as defined in claim 103, wherein: 

the energy information parameter is not transmitted from the encoder to 
the decoder; and 

the means for conducting frame erasure concealment and decoder 
recovery comprises, when a gain of a LP filter of a first non erased frame 
15 received following frame erasure is higher than a gain of a LP filter of a last frame 
erased during said frame erasure, means for adjusting the energy of an LP filter 
excitation signal produced in the decoder during the received first non erased 
frame to a gain of the LP filter of said received first non erased frame using the 
following relation: 

20 




where E-/ is the energy at the end of the current frame, E/_po is the energy of an 
impulse response of the LP filter to the last non erased frame received before the 
25 frame erasure, and Eipi is the energy of the impulse response of the LP filter to 
the received first non erased frame following frame erasure. 

119. A system for encoding and decoding a sound signal, comprising: 
a sound signal encoder responsive to the sound signal for producing a set 
30 of signal-encoding parameters; 
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means for transmitting the signal-encoding parameters to a decoder; 

said decoder for synthesizing the sound signal in response to the signal- 
encoding parameters; and 

a device as recited in any one of claims 60 to 101, for improving 
concealment of frame erasure caused by frames of the encoded sound signal 
erased during transmission from the encoder to the decoder, and for accelerating 
recovery of the decoder after non erased frames of the encoded sound signal 
have been received. 

120. A decoder for decoding an encoded sound signal comprising: 

means responsive to the encoded sound signal for recovering from said 
encoded sound signal a set of signal-encoding parameters; 

means for synthesizing the sound signal in response to the signal- 
encoding parameters; and 

a device as recited in any one of claims 102 to 118, for improving 
concealment of frame erasure caused by frames of the encoded sound signal 
erased during transmission from an encoder to the decoder, and for accelerating 
recovery of the decoder after non erased frames of the encoded sound signal 
have been received. 
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